HCA Tier 1 metadata mapping to DCP metadata fields

ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.

https://ebi-ait.github.io/hca-ebi-wrangler-central/

Apache License 2.0

7 stars 2 forks source link

HCA Tier 1 metadata mapping to DCP metadata fields #1178

Closed arschat closed 2 months ago

arschat commented 1 year ago

Description of the task:

We are given a list of the Tier 1 metadata that are gonna be used in the integration. We are asked to map those metadata to fields in our metadata schema, and provide example values of each field.

Here is the drive folder with the spreadsheet https://drive.google.com/drive/folders/1fobiz332ylmPc738dSoLSQEM7TxjYJBF?usp=sharing

Acceptance criteria for the task:

Wranglers have given their feedback and @arschat has summarised all feedback and reply to HCA Bionetworks committee, with a suggested mapping and some comments and questions.

arschat commented 1 year ago

Notes on differences in fields:

subject_developmental_state: donor_organism.development_stage.ontology_label contains much more detailed information than just prenatal/postnatal.
ethnicity_2: donor_organism.human_specific.ethnicity.ontology_label can include more than 1 entries (type: array)
sampled_site_condition: a combination of specimen_from_organism.diseases.ontology_label and specimen_from_organism.adjacent_diseases.ontology_label could be used
reference_genome_ensembl_release: in analysis_file.genome_patch_version we collect the patch version of the Genome Reference Consortium. Could be converted to a range of Ensembl release versions based on http://www.ensembl.org/info/website/archives/assembly.html
library_preparation_batch: if we have plate_based techniques cell_suspension.plate_based_sequencing.plate_label else we do not collect this information
library_sequencing_batch: we don't have a field that captures this information. There is sequence_file.lane_index but it can only provide information about the same process.insdc_experiment.insdc_experiment_accession

Conditional fields

Fields that are not identical but can be easily converted from HCA metadata schema standards to Tier 1

study_PI: if project.contributors.project_role.ontology_label == "Principal Investigator" then project.contributors.name else if project.contributors.corresponding_contributor == "True" then project.contributors.name
sample_collection_relative_timepoint: if multiple specimen_from_organism.biomaterial_core.biomaterial_id have the same donor_organism.biomaterial_core.biomaterial_id then specimen_from_organism.collection_time if not available specimen_from_organism.biomaterial_core.timecourse.* or donor_organism.biomaterial_core.timecourse.*
age_years: if - in donor_organism.organism_age age is a range if donor_organism.organism_age_unit.ontology_label is not "year" we could divide with corresponding value to convert to decimal year
age_gestational: if donor_organism.gestational_age_unit.ontology_label is not "year" we could divide with corresponding value to convert to decimal year
age_range: if - in donor_organism.organism_age fill here instead
cell enrichment: if enrichment_protocol.method.ontology_label == ”EFO:0009108” or enrichment_protocol.method.ontology_label == ”EFO:0009109” then enrichment_protocol.markers + enrichment_protocol.method.ontology_label else enrichment_protocol.method.ontology_label
sample_cultured: if cell_line.biomaterial_core.biomaterial_id exists or cell_suspension.growth_conditions.culture_environment exists then yes
protocol_doi: for each protocol: *.protocol_core.protocols_io_doi

Questions about fields:

institute: project.contributors.institute would be and array with all the institutes of the authors of the publication. more proper field would be process.process_core.location for the specific process that we would like (collection/ tissue dissociation & handling/ library preparation/ sequencing etc.) but it is not always mentioned and collected. Which part of process would be of interest to collect (tissue collection/ tissue dissociation & handling/ library prep + sequencing)?
library_ID: the last biomaterial entity that we have before sequencing is "Cell Suspension". This means for every project we have a separate cell_suspension.biomaterial_core.biomaterial_id. In some cases we might have some information about library_ID in fields such as cell_suspension.plate_based_sequencing.plate_label, sequence_file.library_prep_id, sequence_file.insdc_run_accessions but we might not always have such information. We could use cell_suspension.biomaterial_core.biomaterial_id for library_ID and publication library ID for library_ID_publication if it is available.
cell_type: we do not have a field for this information for now. In some cases authors might provide an extra analysis file that contains this information.

Other note

For all ontologised fields we have 3 separate fields, text, ontology, ontology_label. First stands for free text, ontology contains the corresponding ontology accession, and ontology label (in some cases more detailed text is added in the text field while the other fields might be constrained by the ontology).

idazucchi commented 1 year ago

institute : is the relevant information sampling location, tissue dissociation and handiling location or sequencing location? The three can happen in three separate places
study_PI: if project.contributors.project_role.ontology_label == "Principal Investigator" then project.contributors.name or if project.contributors.corresponding_contributor == "True" then project.contributors.name this is less accurate, sometimes the first author will be a corresponding author as well, but it's more frequently filled in
sample_collection_relative_timepoint : since collection dates are not available this can be mapped to specimen_from_organism.biomaterial_core.timecourse.* or donor_organism.biomaterial_core.timecourse.*
subject_developmental_state : donor_organism.development_stage.ontology_label is the correct mapping but there’s more detailed information than just prenatal/postnatal
protocol_tissue_dissociation : in principle it’s true that dissociation_protocol.method.ontology_label describes the dissociation protocol but two different enzymatic protocols (collagen V, 25˚, 25’ or Trypsin, 4˚, 1h) would be considered the same - the label is not enough to distinguish between different protocols in one dataset, and would be repeated across datasets. If the aim is to distinguish different dissociation protocols used in one dataset dissociation_protocol.protocol_core.protocol_id would be a better fit, although it might repeat across different datasets
cell_enrichment : this could be a conditional field to add selected markers if enrichment_protocol.method.ontology_label==”EFO:0009108” or enrichment_protocol.method.ontology_label==”EFO:0009109” then enrichment_protocol.markers + enrichment_protocol.method.ontology_label else enrichment_protocol.method.ontology_label
Sample_cultured : can also be true if cell_suspension.growth_conditions.culture_environment exists
library_preparation_batch : sequence_file.library_prep_id groups together files produced from the same library, not different libraries processed in the same machine/chip/plate. For plate based techniques we have cell_suspension.plate_based_sequencing.plate_label but we don’t have an equivalent field for droplet techniques or spatial ones
library_sequencing_batch : we don't have a field that captures this information. There is sequence_file.lane_index but it can only provide information about the same process.insdc_experiment.insdc_experiment_accession
reference_genome_ensembl_release : we don’t have information about the Ensembl release, and we can at best only narrow it down to a range
alignment_software : sequencing_protocol.10x.fastq_method is not a good match, that’s supposed to be filled with software to make the fastq files rather than aligning them
comments : *.biomaterial_core.biomaterial_description can be a catch-all field for comments or extra information that doesn’t fit into the schema

arschat commented 12 months ago

suggestions have been sent, waiting for any feedback or questions.

arschat commented 12 months ago

Got a reply

There are now 11 gaps I still need to fill, and I wanted to ask whether you would be able to work with me to fill these gaps - while I'm enthusiastically learning about metadata I don't have the depth of expertise required to define ontology terms. I'd also be keen to discuss some of your comments.

The 11 missing gaps are:

dataset*
subject_ID_published
library_ID
library_ID_published
library_ID_repository
anatomical_region_level1*
anatomical_region_level2*
library_preparation_batch
library_sequencing batch
author_batch_notes
alignment_software*

Notes on the missing fields:

Although we do not have exact mapping between the DCP metadata and those metadata, if we are obliged to fill these fields, here are some thoughts on that.

dataset

Given that the factor that separates datasets on the same study is the library that was used or any other specified metadata field, we can add the dataset ID too like "Theinpont_2018_10Xv1". About dataset name, in "CxG's datasets of a collection" way, we could add the publication title and the separating factor in parenthesis afterwards. CxG is highly dependent on the number of count matrices & cell embedding coordinates that the authors provide.

anatomical_region_level1 and anatomical_region_level2

Since organ_part is ontologised we could potentially extract the parents ontology term of organ_part. There can be restrictions ontologies of level_1,2 and 3 into specific classes.

alignment_software

Although it is not accurate, if we are obliged to complete, we could add the sequencing_protocol.10x.fastq_method since alignment is usually part of the same 10x pipeline of the fastq creation method.

arschat commented 12 months ago

After discussion with Tony, I will create a report to describe current situation between CellxGene, DCP, Integration teams, mapping of those terms, and propose some options.

arschat commented 11 months ago

On 21 September, we had a call with Lucia Robson and Ellen Todres, and we discussed the DCP mapping for all the Tier 1 metadata. There were some requests on specific fields for DCP metadata.

The requested changes that were discussed were the following:

library_ID field
- major update in a new file type/biomaterial/library.json
- discussion on this in the DCP history:
- #1 doc
- #2 pr on dcp-community
- #3 presentation
library_preparation_batch field
- sequence_file.library_prep_id is only available if we have seq files
- addition; in library_ID field or in type/file/analysis_file.json / type/file/sequence_file.json
library_sequencing_batch
- sequence_file.insdc_run_accessions works but if we do not have accession, there should be a field to record this
- addition optional; minor update in type/file/sequence_file.json
alignment_software field
- addition; major update in the type/file/analysis_file.json

Other comments that were made:

manner_of_death enum (0,1,2,3,4, "unknown")
- make hardy_scape a required field instead of free text cause_of_death
- find a way to model the unknown too
- major update in module/biomaterial/death.json
sample_cultured enum (yes; no)
- could be extracted from current schema if we run a script
  - if cell_line.biomaterial_core.biomaterial_id exists or organoid.biomaterial_core.biomaterial_id exists or cell_suspension.growth_conditions.culture_environment exists then yes
- otherwise, major update in type/biomaterial/cell_suspension.json
subject_developmental_status enum (prenatal; paediatric; adult)
- could be extracted from current schema if we run a script
  - map all developmental ontologies to 2 or 3 fields
- otherwise, major update in module/ontology/development_stage_ontology.json & type/biomaterial/donor_organism.json
post_conception_weeks integer, not range
- gestational age could be converted to PCW with PCW = gestational age -2, if we run a script
age_range will be added only in range to avoid identifiable metadata (range of 5y or 10y)
- convert organism_age to age_range if we run a script
- difficulties might arise if ranges do not match
species enum (homo_sapiens; mus_musculus)
- donor_organism.genus_species.text map ontologies to enum values if we run a script
disease_status enum (healthy; disease)
- donor_organism.diseases.text simplified to enum (healthy; disease) for Tier 1
sample_site_condition enum (healthy; diseased; adjacent)
- specimen_from_organism.diseases.text simplified to enum (healthy; diseased; adjacent)
sample_collection_method enum (brush; scraping; biopsy; surgical_resection; blood, body fluid, other)
- collection_protocol.method.text simplified to enum (brush; scraping; biopsy; surgical_resection; blood, body fluid, other)
cell_enrichment free text
- No or type of enrichment (exclude blood cell lysis and dead cell removal)
- if enrichment_protocol.method.ontology_label == ”EFO:0009108” or enrichment_protocol.method.ontology_label == ”EFO:0009109” then enrichment_protocol.markers + enrichment_protocol.method.ontology_label else enrichment_protocol.method.ontology_label
Ellen verified that that sequencing run ID works as library_sequencing_batch
sample_source enum (postmortem donor; organ donor; surgical donor)
- could be extracted from current schema (after transplant PR that Ida is working on) if we run a script
  - if PR.FIELD == allograft then "organ donor" else if donor_organism.is_living == no then "postmortem donor" else "surgical donor"
- Comment from the spreadsheet:
  
  The study subgroup that the participant belongs to. This indicates whether the participant was a postmortem donor, an organ donor, or a surgical donor (includes blood samples / biopsies)
- otherwise, major update in type/biomaterial/specimen_from_organism.json

arschat commented 11 months ago

Asked some more info about the sample_source field, in order to proceed accordingly, with the transplant PR. Malte Luecken replied:

I wonder if organ donor might also include tissue from organs that were rejected for donation. That would be a larger group of samples than only allografts/xenografts.

After a miroboard brainstorm I drafted a reply, in order to separate the questions this fields asks, and define which of those is the required information.

Hello Ellen and Malte,

Thank you for the replies!

It is my understanding that sample from a post-mortem donor will be of low quality compared to a living donor, while in case of a transplant, we might have genetic material from multiple organisms in the same sample. Are these all the different effects we would like to record in the sample_source field or is there something more?

Taken from the options in the enum and your point Malte, the sample_source information, could be broken down to 2+1 questions:

was the donor deceased at the time of collection?

is the sample part of a transplant organ (either allograft or xenograft)?

if Q2 is yes, transplant might be healthy or rejected after the surgical procedure (either hyperacute, acute or chronic rejection).

Based on the questions above, I understand we have the following decision tree: If Q1 is yes, then sample_source should be post-mortem donor. If Q2 is yes, then sample_source should be organ donor (if this is the case, another name might be more descriptive for example transplant tissue). If answer to Q1 & Q2 is no, then sample_source should be surgical donor.

A small note here, if both Q1 & Q2 is yes, i.e. tissue of the deceased was a transplant, we have to decide which information we record, alive/dead or transplant/not transplant.

About your point, Malte, if an organ was rejected for donation before the organ transplantation surgical procedure, then the sample would be healthy (in order to be considered initially suitable for transplant) and would not have tissue (or severely interacted with tissue) from another organism. A. Would we still like to define the tissue rejected for donation as organ donor or surgical donor would be more suitable?

Finally, I understand that for Tier 1 metadata, we would like to have simple metadata, so I would like to ask whether the following metadata would be of interest in recording here: B. Transplant is from the same organism, same species or different species (autograft; allograft; xenograft) described here C. Transplant was rejected and what type of rejection was (hyperacute; acute; chronic) described here

Thank you both for your feedback, I am looking forward for your thoughts.

arschat commented 11 months ago

Malte replied

I don't think that "organ_donor" covers any organs that were at some point transplanted into another host and then removed for sampling. It was my understanding that these tissue samples are from individuals who donate their organs to science and maybe their organs were previously considered not fit for transplantation (this is what I meant with rejected). I'm checking this with some bionetwork coordinators now to check that my understanding is correct though. Maybe Chloe could weigh in here too.

Surgical donor would then be where the individual is still alive and part of their tissue is taken out. Overall I don't think xenograft/allografts play a role in this metadata field at all. But again, I'm not an expert in tissue sampling.

As for the reason for tissue transplant rejection, I don' think we would be able to get that information. It may also be restricted access/protected.

Our reply

Hi Malte, I hope you're doing well and thanks for your earlier input about the transplanted organ. I have a few follow-up questions to help me better understand how do we differentiate between "organ donor" and the other options "post-mortem donor" and "surgical donor":

Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected?

Is "organ donor" a category that can apply to both living and deceased individuals? Any insights on these points would be greatly appreciated as we work to tackle the updates on the DCP metadata schema. Thanks in advance, and I look forward to hearing from you!

Malte reply:

I will try to answer these questions as best I can, but I just want to highlight that I'm really not the expert here as I haven't collect tissue myself. Maybe it's worth also talking to someone with a more biological/clinical background.

Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected? This I can't really answer, as I don't know the clinical practice for sample collection from tissue that is collected for scientific purposes.

Is "organ donor" a category that can apply to both living and deceased individuals? My understanding is that "organ donor" applies to samples from deceased individuals who donate their organs to science. The quality of these samples is usually not as good as from living individuals.

Where did the categories post-mortem donor, organ donor, and surgical donor come from? Maybe it's worth checking more in that resource. My main point earlier was that already transplanted tissue (including allografts/xenografts) are not likely to be samples that end up in reference atlases as these look very different from "normal tissue".

arschat commented 11 months ago

Current snapshot of Tier 1 mapping:

alignment_software pr has been made HumanCellAtlas/metadata-schema#1534
sample_source discussions about definitions and differentiations of 3 options stalled
library_ID fields internal (EBI/UCSC) discussions on how to model this information

arschat commented 11 months ago

Update on library_ID after EBI internal discussion today. We have the following options:

library as a new biomaterial entity:
- create a new entity, that has the following fields:
  1. library.biomaterial_core
  2. library.preparation_batch
  3. library.sequencing_batch
  4. cell_suspension.biomaterial_core.biomaterial_id
cell_suspension as library:
- instead of creating a new library biomaterial entity, we will use the cell_suspension entity to describe libraries
- we will need two new fields in cell_suspension, to describe the library_preparation_batch & library_sequencing_batch
batch module in analysis_file
- create a batch module in analysis_file that contains library_ID, sequence_batch, cell_suspension_ID, analysis_file_ID
- module will be in a separate tab (the way projects.publication does) but will also include an analysis_file_ID to connect back to specific analysis_file

Option 1 will result in multiple redesigns of ingest, import and data browser, and will need a lot of time to design and apply those changes. It was voted down by both UCSC and EBI wranglers.

Between option 2 and 3 we (EBI wranglers) decided to proceed with the option 2 since it will need less changes in the schema, and ingesting of the data, and will be more intuitive for the wrangler to fill compared with option 3.

arschat commented 10 months ago

Alignment software PR

After Hannes' comments, alignment_software & alignment_software_version was moved to analysis_protocol & converted to optional (alignment_software in dependentRequired for alignment_software_version)

arschat commented 9 months ago

Gene Annotation

FIeld gene_annotation_version

Description: Ensembl or NCBI/RefSeq release version number
Examples: v110; GCF_000001405.40

idazucchi commented 7 months ago

Pending discussion on the library id and sample source - schedule discussion with Tony @arschat

arschat commented 6 months ago

Current options for library_ID:

Cell_suspension as a library_ID:
- This would not be accurate since cell_suspension is not identical to the library. However, since Tier 1 have 3 biomaterials (donor, sample, library) it would be similar to our (donor_organism, specimen_from_organism, cell_suspension) 3 level biomaterials of a simple experimental design. For most experiments we've describe however, one cell_suspension provides one library.
- we could store the batch and run information in the sequence_file, in the fields we already have for this.
Library field in the sequence_file tab:
- we could create a new field in the sequence tab, where we could record the library_IDs associated for each. However, this way, we would not be able to associate analysis_files with the library_IDs, if the libraries from 1 cell_suspension are pooled.
Library as a new biomaterial:
- This would be the most accurate, however, it would take much more time to implement
- The simple experimental design would be described by 4 biomaterials in contrast to the 3 biomaterials described in Tier 1, and this might confuse bionetwork contributors that are filling the spreadsheet.

arschat commented 6 months ago

About organ_donor in sample_source we got the following clarification from Lucia & Malte.

If I recall correctly, this is relevant as post-mortem tissue that is no longer properly perfused shows a distinct transcriptomic effect (you never get tissue immediately after death, usually 2-3 hours is minimum). Organ donor samples are from individuals that are e.g., brain dead but where the organ is still kept alive so that no cellular degeneration is evident due to lack of perfusion. This looks more like living tissue than post-mortem. On the other hand, surgical tissue is from an individual who is still very much alive and may have e.g., eaten and metabolized food recently.

I'm not sure I would say "organ donor from a living subject", as organs will only be removed if the donor is declared brain-dead... which I'm not sure counts as living. Any bionetwork coordinator for bionetwork that collected tissue blocks of some sort will know this better than me though.

There are two options.

donor_type enum field
- organ_donor: biomaterial was intended for organ transplantation
- surgical_donor: biomaterial was collected during a surgical, medical or non-invasive procedure (for example biopsy, blood draw, bronchoalveolar lavage, nasal brush, scraping etc').
organ_donor boolean field
- Whether biomaterial was intended for organ transplantation.

Second option is clearer and more robust, and information about surgical donor is recorded in the collection protocol.

About mapping from Tier 1 to DCP:

if sample_source == "organ donor":
   donor_organism.organ_donor = True
   donor_organism.is_living = False *
elif sample_source == "post-mortem donor":
   donor_organism.organ_donor = False
   donor_organism.is_living = False
elif sample_source == "surgical donor":
   donor_organism.organ_donor = False
   donor_organism.is_living = True

Although with Tier 1 modeling, there is ambiguity for the is_living option if the subject is an organ donor, for most organs/ bionetworks the subject should be considered deceased (ideally we could specify the donor_organism.death.organ_donation_death_type).

arschat commented 6 months ago

Last update on mapping here Template with tier 1 at row 4 here

arschat commented 6 months ago

Conversion from anndata tier 1 object, to DCP spreadsheet with a jupyter notebook and a interchangeable mapping dictionary here.

arschat commented 5 months ago

Conversion from Tier 1 to DCP moved here #1252

arschat commented 5 months ago

transplant_organ PR merged! Working on intron_inclusion PR.

idazucchi commented 5 months ago

intron_inclusion PR merged! last thing to discuss:

[ ] do we add a field for sequencing batch? edit the run accession field?

idazucchi commented 4 months ago

we are adding a field for sequencing batch - it will be imoprtant for projects deposited directly with HCA that don't have Run accessions

arschat commented 3 months ago

library_sequence_run: This could be either the ID of the sequence RUN or the sequence BATCH. Asked for clarification and got the following replies:

my assumption is that we're interested in the batch (as there shouldn't be any differences between runs if they're all being processed at the same time. However, I'm absolutely not an expert so need to check with someone and get back to you

Confirming that library_sequencing_run is a custom field. Also, Library_sequencing_run is the higher order term. Whereas, library_preparation_batch i think is the lower order term as it's referring to libraries sequenced on the same plate/chip

arschat commented 2 months ago

sequence_run_batch which is equivalent to library_sequence_run is now merged in prod. sample_collection_site can be recorded in the process.process_core.location.

Mapping of Tier 1 has been completed, referenced here.

Tier 1 fields that are not mapped:

batch_condition, default_embedding, comments, author_batch_notes, tissue_type, is_primary_data, author_cell_type, cell_type_ontology_term_id

All mapping here.

Tier 1	HCA metadata schema
title	project.project_core.project_title
study_pi	project.contributors.name
batch_condition	NA
default_embedding	NA
comments	NA
sample_id	specimen_from_organism.biomaterial_core.biomaterial_id
donor_id	donor_organism.biomaterial_core.biomaterial_id
protocol_url	library_preparation_protocol.protocol_core.protocols_io_doi
institute	project.contributors.institute
sample_collection_site	process.process_core.location
sample_collection_relative_time_point	specimen_from_organism.biomaterial_core.timecourse.value
library_id	cell_suspension.biomaterial_core.biomaterial_id
library_id_repository	cell_suspension.biomaterial_core.biomaterial_name
author_batch_notes	NA
organism_ontology_term_id	donor_organism.biomaterial_core.ncbi_taxon_id
manner_of_death	donor_organism.death.hardy_scale
sample_source	donor_organism.is_living & specimen_from_organism.transplant_organ
sex_ontology_term_id	donor_organism.sex
sample_collection_method	collection_protocol.method.text
tissue_type	NA
sampled_site_condition	donor_organism.diseases.text & specimen_from_organism.diseases.text
tissue_ontology_term_id	specimen_from_organism.organ.ontology
tissue_free_text	specimen_from_organism.organ.text
sample_preservation_method	specimen_from_organism.preservation_storage.storage_method
suspension_type	library_preparation_protocol.nucleic_acid_source
cell_enrichment	enrichment_protocol.markers
cell_viability_percentage	cell_suspension.cell_morphology.percent_cell_viability
cell_number_loaded	cell_suspension.estimated_cell_count
sample_collection_year	specimen_from_organism.collection_time
assay_ontology_term_id	library_preparation_protocol.library_construction_method.ontology
library_preparation_batch	sequence_file.library_prep_id
library_sequencing_run	sequence_run_batch
sequenced_fragment	library_preparation_protocol.end_bias
sequencing_platform	sequencing_protocol.instrument_manufacturer_model.text
is_primary_data	NA
reference_genome	analysis_file.genome_assembly_version
gene_annotation_version	analysis_protocol.gene_annotation_version
alignment_software	analysis_protocol.alignment_software_version
intron_inclusion	analysis_protocol.intron_inclusion
author_cell_type	NA
cell_type_ontology_term_id	NA
disease_ontology_term_id	donor_organism.diseases.ontology
self_reported_ethnicity_ontology_term_id	donor_organism.human_specific.ethnicity.ontology
development_stage_ontology_term_id	donor_organism.development_stage.ontology

This ticket can now close.