Closed arschat closed 2 months ago
donor_organism.development_stage.ontology_label
contains much more detailed information than just prenatal/postnatal.donor_organism.human_specific.ethnicity.ontology_label
can include more than 1 entries (type: array)specimen_from_organism.diseases.ontology_label
and specimen_from_organism.adjacent_diseases.ontology_label
could be usedanalysis_file.genome_patch_version
we collect the patch version of the Genome Reference Consortium. Could be converted to a range of Ensembl release versions based on http://www.ensembl.org/info/website/archives/assembly.htmlcell_suspension.plate_based_sequencing.plate_label
else we do not collect this informationsequence_file.lane_index
but it can only provide information about the same process.insdc_experiment.insdc_experiment_accession
Fields that are not identical but can be easily converted from HCA metadata schema standards to Tier 1
project.contributors.project_role.ontology_label
== "Principal Investigator" then project.contributors.name
else if project.contributors.corresponding_contributor
== "True" then project.contributors.name
specimen_from_organism.biomaterial_core.biomaterial_id
have the same donor_organism.biomaterial_core.biomaterial_id
then specimen_from_organism.collection_time
if not available specimen_from_organism.biomaterial_core.timecourse.*
or donor_organism.biomaterial_core.timecourse.*
-
in donor_organism.organism_age
age is a range
if donor_organism.organism_age_unit.ontology_label
is not "year" we could divide with corresponding value to convert to decimal yeardonor_organism.gestational_age_unit.ontology_label
is not "year" we could divide with corresponding value to convert to decimal year-
in donor_organism.organism_age
fill here insteadenrichment_protocol.method.ontology_label
== ”EFO:0009108” or enrichment_protocol.method.ontology_label
== ”EFO:0009109” then enrichment_protocol.markers
+ enrichment_protocol.method.ontology_label
else enrichment_protocol.method.ontology_label
cell_line.biomaterial_core.biomaterial_id
exists or cell_suspension.growth_conditions.culture_environment
exists then yes*.protocol_core.protocols_io_doi
project.contributors.institute
would be and array with all the institutes of the authors of the publication. more proper field would be process.process_core.location
for the specific process that we would like (collection/ tissue dissociation & handling/ library preparation/ sequencing etc.) but it is not always mentioned and collected. Which part of process would be of interest to collect (tissue collection/ tissue dissociation & handling/ library prep + sequencing)?cell_suspension.biomaterial_core.biomaterial_id
. In some cases we might have some information about library_ID in fields such as cell_suspension.plate_based_sequencing.plate_label
, sequence_file.library_prep_id
, sequence_file.insdc_run_accessions
but we might not always have such information.
We could use cell_suspension.biomaterial_core.biomaterial_id
for library_ID and publication library ID for library_ID_publication if it is available.For all ontologised fields we have 3 separate fields, text
, ontology
, ontology_label
. First stands for free text, ontology contains the corresponding ontology accession, and ontology label (in some cases more detailed text is added in the text
field while the other fields might be constrained by the ontology).
project.contributors.project_role.ontology_label
== "Principal Investigator" then project.contributors.name
or if project.contributors.corresponding_contributor
== "True" then project.contributors.name
this is less accurate, sometimes the first author will be a corresponding author as well, but it's more frequently filled inspecimen_from_organism.biomaterial_core.timecourse.*
or donor_organism.biomaterial_core.timecourse.*
donor_organism.development_stage.ontology_label
is the correct mapping but there’s more detailed information than just prenatal/postnataldissociation_protocol.method.ontology_label
describes the dissociation protocol but two different enzymatic protocols (collagen V, 25˚, 25’ or Trypsin, 4˚, 1h) would be considered the same - the label is not enough to distinguish between different protocols in one dataset, and would be repeated across datasets. If the aim is to distinguish different dissociation protocols used in one dataset dissociation_protocol.protocol_core.protocol_id
would be a better fit, although it might repeat across different datasetsenrichment_protocol.method.ontology_label
==”EFO:0009108” or enrichment_protocol.method.ontology_label
==”EFO:0009109” then enrichment_protocol.markers
+ enrichment_protocol.method.ontology_label
else enrichment_protocol.method.ontology_label
cell_suspension.growth_conditions.culture_environment
existssequence_file.library_prep_id
groups together files produced from the same library, not different libraries processed in the same machine/chip/plate. For plate based techniques we have cell_suspension.plate_based_sequencing.plate_label
but we don’t have an equivalent field for droplet techniques or spatial onessequence_file.lane_index
but it can only provide information about the same process.insdc_experiment.insdc_experiment_accession
sequencing_protocol.10x.fastq_method
is not a good match, that’s supposed to be filled with software to make the fastq files rather than aligning them*.biomaterial_core.biomaterial_description
can be a catch-all field for comments or extra information that doesn’t fit into the schemasuggestions have been sent, waiting for any feedback or questions.
Got a reply
There are now 11 gaps I still need to fill, and I wanted to ask whether you would be able to work with me to fill these gaps - while I'm enthusiastically learning about metadata I don't have the depth of expertise required to define ontology terms. I'd also be keen to discuss some of your comments.
The 11 missing gaps are:
Although we do not have exact mapping between the DCP metadata and those metadata, if we are obliged to fill these fields, here are some thoughts on that.
Given that the factor that separates datasets on the same study is the library that was used or any other specified metadata field, we can add the dataset ID too like "Theinpont_2018_10Xv1". About dataset name, in "CxG's datasets of a collection" way, we could add the publication title and the separating factor in parenthesis afterwards. CxG is highly dependent on the number of count matrices & cell embedding coordinates that the authors provide.
Since organ_part is ontologised we could potentially extract the parents ontology term of organ_part. There can be restrictions ontologies of level_1,2 and 3 into specific classes.
Although it is not accurate, if we are obliged to complete, we could add the sequencing_protocol.10x.fastq_method
since alignment is usually part of the same 10x pipeline of the fastq creation method.
After discussion with Tony, I will create a report to describe current situation between CellxGene, DCP, Integration teams, mapping of those terms, and propose some options.
On 21 September, we had a call with Lucia Robson and Ellen Todres, and we discussed the DCP mapping for all the Tier 1 metadata. There were some requests on specific fields for DCP metadata.
The requested changes that were discussed were the following:
Other comments that were made:
The study subgroup that the participant belongs to. This indicates whether the participant was a postmortem donor, an organ donor, or a surgical donor (includes blood samples / biopsies)
Asked some more info about the sample_source field, in order to proceed accordingly, with the transplant PR. Malte Luecken replied:
I wonder if organ donor might also include tissue from organs that were rejected for donation. That would be a larger group of samples than only allografts/xenografts.
After a miroboard brainstorm I drafted a reply, in order to separate the questions this fields asks, and define which of those is the required information.
Hello Ellen and Malte,
Thank you for the replies!
It is my understanding that sample from a post-mortem donor will be of low quality compared to a living donor, while in case of a transplant, we might have genetic material from multiple organisms in the same sample. Are these all the different effects we would like to record in the sample_source field or is there something more?
Taken from the options in the enum and your point Malte, the sample_source information, could be broken down to 2+1 questions:
- was the donor deceased at the time of collection?
- is the sample part of a transplant organ (either allograft or xenograft)?
- if Q2 is yes, transplant might be healthy or rejected after the surgical procedure (either hyperacute, acute or chronic rejection).
Based on the questions above, I understand we have the following decision tree: If Q1 is yes, then sample_source should be
post-mortem donor
. If Q2 is yes, then sample_source should beorgan donor
(if this is the case, another name might be more descriptive for exampletransplant tissue
). If answer to Q1 & Q2 is no, then sample_source should besurgical donor
.A small note here, if both Q1 & Q2 is yes, i.e. tissue of the deceased was a transplant, we have to decide which information we record, alive/dead or transplant/not transplant.
About your point, Malte, if an organ was rejected for donation before the organ transplantation surgical procedure, then the sample would be healthy (in order to be considered initially suitable for transplant) and would not have tissue (or severely interacted with tissue) from another organism. A. Would we still like to define the tissue rejected for donation as
organ donor
orsurgical donor
would be more suitable?Finally, I understand that for Tier 1 metadata, we would like to have simple metadata, so I would like to ask whether the following metadata would be of interest in recording here: B. Transplant is from the same organism, same species or different species (autograft; allograft; xenograft) described here C. Transplant was rejected and what type of rejection was (hyperacute; acute; chronic) described here
Thank you both for your feedback, I am looking forward for your thoughts.
Malte replied
I don't think that "organ_donor" covers any organs that were at some point transplanted into another host and then removed for sampling. It was my understanding that these tissue samples are from individuals who donate their organs to science and maybe their organs were previously considered not fit for transplantation (this is what I meant with rejected). I'm checking this with some bionetwork coordinators now to check that my understanding is correct though. Maybe Chloe could weigh in here too.
Surgical donor would then be where the individual is still alive and part of their tissue is taken out. Overall I don't think xenograft/allografts play a role in this metadata field at all. But again, I'm not an expert in tissue sampling.
As for the reason for tissue transplant rejection, I don' think we would be able to get that information. It may also be restricted access/protected.
Our reply
Hi Malte, I hope you're doing well and thanks for your earlier input about the transplanted organ. I have a few follow-up questions to help me better understand how do we differentiate between "organ donor" and the other options "post-mortem donor" and "surgical donor":
- Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected?
- Is "organ donor" a category that can apply to both living and deceased individuals? Any insights on these points would be greatly appreciated as we work to tackle the updates on the DCP metadata schema. Thanks in advance, and I look forward to hearing from you!
Malte reply:
I will try to answer these questions as best I can, but I just want to highlight that I'm really not the expert here as I haven't collect tissue myself. Maybe it's worth also talking to someone with a more biological/clinical background.
- Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected? This I can't really answer, as I don't know the clinical practice for sample collection from tissue that is collected for scientific purposes.
- Is "organ donor" a category that can apply to both living and deceased individuals? My understanding is that "organ donor" applies to samples from deceased individuals who donate their organs to science. The quality of these samples is usually not as good as from living individuals.
Where did the categories post-mortem donor, organ donor, and surgical donor come from? Maybe it's worth checking more in that resource. My main point earlier was that already transplanted tissue (including allografts/xenografts) are not likely to be samples that end up in reference atlases as these look very different from "normal tissue".
Current snapshot of Tier 1 mapping:
alignment_software
pr has been made HumanCellAtlas/metadata-schema#1534sample_source
discussions about definitions and differentiations of 3 options stalledlibrary_ID
fields internal (EBI/UCSC) discussions on how to model this informationUpdate on library_ID
after EBI internal discussion today.
We have the following options:
library_preparation_batch
& library_sequencing_batch
library_ID
, sequence_batch
, cell_suspension_ID
, analysis_file_ID
analysis_file_ID
to connect back to specific analysis_fileOption 1 will result in multiple redesigns of ingest, import and data browser, and will need a lot of time to design and apply those changes. It was voted down by both UCSC and EBI wranglers.
Between option 2 and 3 we (EBI wranglers) decided to proceed with the option 2 since it will need less changes in the schema, and ingesting of the data, and will be more intuitive for the wrangler to fill compared with option 3.
After Hannes' comments, alignment_software
& alignment_software_version
was moved to analysis_protocol & converted to optional (alignment_software in dependentRequired for alignment_software_version)
FIeld gene_annotation_version
Pending discussion on the library id and sample source - schedule discussion with Tony @arschat
Current options for library_ID:
donor
, sample
, library
) it would be similar to our (donor_organism
, specimen_from_organism
, cell_suspension
) 3 level biomaterials of a simple experimental design. For most experiments we've describe however, one cell_suspension provides one library.library_ID
s associated for each. However, this way, we would not be able to associate analysis_files with the library_IDs, if the libraries from 1 cell_suspension are pooled.
About organ_donor
in sample_source
we got the following clarification from Lucia & Malte.
If I recall correctly, this is relevant as post-mortem tissue that is no longer properly perfused shows a distinct transcriptomic effect (you never get tissue immediately after death, usually 2-3 hours is minimum). Organ donor samples are from individuals that are e.g., brain dead but where the organ is still kept alive so that no cellular degeneration is evident due to lack of perfusion. This looks more like living tissue than post-mortem. On the other hand, surgical tissue is from an individual who is still very much alive and may have e.g., eaten and metabolized food recently.
I'm not sure I would say "organ donor from a living subject", as organs will only be removed if the donor is declared brain-dead... which I'm not sure counts as living. Any bionetwork coordinator for bionetwork that collected tissue blocks of some sort will know this better than me though.
There are two options.
donor_type
enum field
organ_donor
boolean field
Second option is clearer and more robust, and information about surgical donor is recorded in the collection protocol.
About mapping from Tier 1 to DCP:
if sample_source == "organ donor":
donor_organism.organ_donor = True
donor_organism.is_living = False *
elif sample_source == "post-mortem donor":
donor_organism.organ_donor = False
donor_organism.is_living = False
elif sample_source == "surgical donor":
donor_organism.organ_donor = False
donor_organism.is_living = True
Although with Tier 1 modeling, there is ambiguity for the is_living option if the subject is an organ donor, for most organs/ bionetworks the subject should be considered deceased (ideally we could specify the donor_organism.death.organ_donation_death_type
).
Conversion from anndata tier 1 object, to DCP spreadsheet with a jupyter notebook and a interchangeable mapping dictionary here.
Conversion from Tier 1 to DCP moved here #1252
transplant_organ
PR merged!
Working on intron_inclusion
PR.
intron_inclusion
PR merged!
last thing to discuss:
we are adding a field for sequencing batch - it will be imoprtant for projects deposited directly with HCA that don't have Run accessions
library_sequence_run
:
This could be either the ID of the sequence RUN or the sequence BATCH. Asked for clarification and got the following replies:
my assumption is that we're interested in the batch (as there shouldn't be any differences between runs if they're all being processed at the same time. However, I'm absolutely not an expert so need to check with someone and get back to you
Confirming that library_sequencing_run is a custom field. Also, Library_sequencing_run is the higher order term. Whereas, library_preparation_batch i think is the lower order term as it's referring to libraries sequenced on the same plate/chip
sequence_run_batch
which is equivalent to library_sequence_run
is now merged in prod. sample_collection_site
can be recorded in the process.process_core.location
.
Mapping of Tier 1 has been completed, referenced here.
Tier 1 fields that are not mapped:
batch_condition, default_embedding, comments, author_batch_notes, tissue_type, is_primary_data, author_cell_type, cell_type_ontology_term_id
All mapping here.
Tier 1 | HCA metadata schema |
---|---|
title | project.project_core.project_title |
study_pi | project.contributors.name |
batch_condition | NA |
default_embedding | NA |
comments | NA |
sample_id | specimen_from_organism.biomaterial_core.biomaterial_id |
donor_id | donor_organism.biomaterial_core.biomaterial_id |
protocol_url | library_preparation_protocol.protocol_core.protocols_io_doi |
institute | project.contributors.institute |
sample_collection_site | process.process_core.location |
sample_collection_relative_time_point | specimen_from_organism.biomaterial_core.timecourse.value |
library_id | cell_suspension.biomaterial_core.biomaterial_id |
library_id_repository | cell_suspension.biomaterial_core.biomaterial_name |
author_batch_notes | NA |
organism_ontology_term_id | donor_organism.biomaterial_core.ncbi_taxon_id |
manner_of_death | donor_organism.death.hardy_scale |
sample_source | donor_organism.is_living & specimen_from_organism.transplant_organ |
sex_ontology_term_id | donor_organism.sex |
sample_collection_method | collection_protocol.method.text |
tissue_type | NA |
sampled_site_condition | donor_organism.diseases.text & specimen_from_organism.diseases.text |
tissue_ontology_term_id | specimen_from_organism.organ.ontology |
tissue_free_text | specimen_from_organism.organ.text |
sample_preservation_method | specimen_from_organism.preservation_storage.storage_method |
suspension_type | library_preparation_protocol.nucleic_acid_source |
cell_enrichment | enrichment_protocol.markers |
cell_viability_percentage | cell_suspension.cell_morphology.percent_cell_viability |
cell_number_loaded | cell_suspension.estimated_cell_count |
sample_collection_year | specimen_from_organism.collection_time |
assay_ontology_term_id | library_preparation_protocol.library_construction_method.ontology |
library_preparation_batch | sequence_file.library_prep_id |
library_sequencing_run | sequence_run_batch |
sequenced_fragment | library_preparation_protocol.end_bias |
sequencing_platform | sequencing_protocol.instrument_manufacturer_model.text |
is_primary_data | NA |
reference_genome | analysis_file.genome_assembly_version |
gene_annotation_version | analysis_protocol.gene_annotation_version |
alignment_software | analysis_protocol.alignment_software_version |
intron_inclusion | analysis_protocol.intron_inclusion |
author_cell_type | NA |
cell_type_ontology_term_id | NA |
disease_ontology_term_id | donor_organism.diseases.ontology |
self_reported_ethnicity_ontology_term_id | donor_organism.human_specific.ethnicity.ontology |
development_stage_ontology_term_id | donor_organism.development_stage.ontology |
This ticket can now close.
Description of the task:
We are given a list of the Tier 1 metadata that are gonna be used in the integration. We are asked to map those metadata to fields in our metadata schema, and provide example values of each field.
Here is the drive folder with the spreadsheet https://drive.google.com/drive/folders/1fobiz332ylmPc738dSoLSQEM7TxjYJBF?usp=sharing
Acceptance criteria for the task:
Wranglers have given their feedback and @arschat has summarised all feedback and reply to HCA Bionetworks committee, with a suggested mapping and some comments and questions.