ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

HCA Tier 1 metadata mapping to DCP metadata fields #1178

Closed arschat closed 2 months ago

arschat commented 1 year ago

Description of the task:

We are given a list of the Tier 1 metadata that are gonna be used in the integration. We are asked to map those metadata to fields in our metadata schema, and provide example values of each field.

Here is the drive folder with the spreadsheet https://drive.google.com/drive/folders/1fobiz332ylmPc738dSoLSQEM7TxjYJBF?usp=sharing

Acceptance criteria for the task:

Wranglers have given their feedback and @arschat has summarised all feedback and reply to HCA Bionetworks committee, with a suggested mapping and some comments and questions.

arschat commented 1 year ago

Notes on differences in fields:

Conditional fields

Fields that are not identical but can be easily converted from HCA metadata schema standards to Tier 1

Questions about fields:

Other note

For all ontologised fields we have 3 separate fields, text, ontology, ontology_label. First stands for free text, ontology contains the corresponding ontology accession, and ontology label (in some cases more detailed text is added in the text field while the other fields might be constrained by the ontology).

idazucchi commented 1 year ago
arschat commented 12 months ago

suggestions have been sent, waiting for any feedback or questions.

arschat commented 12 months ago

Got a reply

There are now 11 gaps I still need to fill, and I wanted to ask whether you would be able to work with me to fill these gaps - while I'm enthusiastically learning about metadata I don't have the depth of expertise required to define ontology terms. I'd also be keen to discuss some of your comments.

The 11 missing gaps are:

Notes on the missing fields:

Although we do not have exact mapping between the DCP metadata and those metadata, if we are obliged to fill these fields, here are some thoughts on that.

dataset

Given that the factor that separates datasets on the same study is the library that was used or any other specified metadata field, we can add the dataset ID too like "Theinpont_2018_10Xv1". About dataset name, in "CxG's datasets of a collection" way, we could add the publication title and the separating factor in parenthesis afterwards. CxG is highly dependent on the number of count matrices & cell embedding coordinates that the authors provide.

anatomical_region_level1 and anatomical_region_level2

Since organ_part is ontologised we could potentially extract the parents ontology term of organ_part. There can be restrictions ontologies of level_1,2 and 3 into specific classes.

alignment_software

Although it is not accurate, if we are obliged to complete, we could add the sequencing_protocol.10x.fastq_method since alignment is usually part of the same 10x pipeline of the fastq creation method.

arschat commented 12 months ago

After discussion with Tony, I will create a report to describe current situation between CellxGene, DCP, Integration teams, mapping of those terms, and propose some options.

arschat commented 11 months ago

On 21 September, we had a call with Lucia Robson and Ellen Todres, and we discussed the DCP mapping for all the Tier 1 metadata. There were some requests on specific fields for DCP metadata.

The requested changes that were discussed were the following:

  1. library_ID field
  2. library_preparation_batch field
    • sequence_file.library_prep_id is only available if we have seq files
    • addition; in library_ID field or in type/file/analysis_file.json / type/file/sequence_file.json
  3. library_sequencing_batch
    • sequence_file.insdc_run_accessions works but if we do not have accession, there should be a field to record this
    • addition optional; minor update in type/file/sequence_file.json
  4. alignment_software field
    • addition; major update in the type/file/analysis_file.json

Other comments that were made:

arschat commented 11 months ago

Asked some more info about the sample_source field, in order to proceed accordingly, with the transplant PR. Malte Luecken replied:

I wonder if organ donor might also include tissue from organs that were rejected for donation. That would be a larger group of samples than only allografts/xenografts.

After a miroboard brainstorm I drafted a reply, in order to separate the questions this fields asks, and define which of those is the required information.

Hello Ellen and Malte,

Thank you for the replies!

It is my understanding that sample from a post-mortem donor will be of low quality compared to a living donor, while in case of a transplant, we might have genetic material from multiple organisms in the same sample. Are these all the different effects we would like to record in the sample_source field or is there something more?

Taken from the options in the enum and your point Malte, the sample_source information, could be broken down to 2+1 questions:

  1. was the donor deceased at the time of collection?
  2. is the sample part of a transplant organ (either allograft or xenograft)?
  3. if Q2 is yes, transplant might be healthy or rejected after the surgical procedure (either hyperacute, acute or chronic rejection).

Based on the questions above, I understand we have the following decision tree: If Q1 is yes, then sample_source should be post-mortem donor. If Q2 is yes, then sample_source should be organ donor (if this is the case, another name might be more descriptive for example transplant tissue). If answer to Q1 & Q2 is no, then sample_source should be surgical donor.

A small note here, if both Q1 & Q2 is yes, i.e. tissue of the deceased was a transplant, we have to decide which information we record, alive/dead or transplant/not transplant.

About your point, Malte, if an organ was rejected for donation before the organ transplantation surgical procedure, then the sample would be healthy (in order to be considered initially suitable for transplant) and would not have tissue (or severely interacted with tissue) from another organism. A. Would we still like to define the tissue rejected for donation as organ donor or surgical donor would be more suitable?

Finally, I understand that for Tier 1 metadata, we would like to have simple metadata, so I would like to ask whether the following metadata would be of interest in recording here: B. Transplant is from the same organism, same species or different species (autograft; allograft; xenograft) described here C. Transplant was rejected and what type of rejection was (hyperacute; acute; chronic) described here

Thank you both for your feedback, I am looking forward for your thoughts.

arschat commented 11 months ago

Malte replied

I don't think that "organ_donor" covers any organs that were at some point transplanted into another host and then removed for sampling. It was my understanding that these tissue samples are from individuals who donate their organs to science and maybe their organs were previously considered not fit for transplantation (this is what I meant with rejected). I'm checking this with some bionetwork coordinators now to check that my understanding is correct though. Maybe Chloe could weigh in here too.

Surgical donor would then be where the individual is still alive and part of their tissue is taken out. Overall I don't think xenograft/allografts play a role in this metadata field at all. But again, I'm not an expert in tissue sampling.

As for the reason for tissue transplant rejection, I don' think we would be able to get that information. It may also be restricted access/protected.

Our reply

Hi Malte, I hope you're doing well and thanks for your earlier input about the transplanted organ. I have a few follow-up questions to help me better understand how do we differentiate between "organ donor" and the other options "post-mortem donor" and "surgical donor":

  1. Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected?
  2. Is "organ donor" a category that can apply to both living and deceased individuals? Any insights on these points would be greatly appreciated as we work to tackle the updates on the DCP metadata schema. Thanks in advance, and I look forward to hearing from you!

Malte reply:

I will try to answer these questions as best I can, but I just want to highlight that I'm really not the expert here as I haven't collect tissue myself. Maybe it's worth also talking to someone with a more biological/clinical background.

  1. Does the category "organ donor" specifically pertain to the entire organ being collected, or can it also include cases where only a part of tissue is collected? This I can't really answer, as I don't know the clinical practice for sample collection from tissue that is collected for scientific purposes.
    1. Is "organ donor" a category that can apply to both living and deceased individuals? My understanding is that "organ donor" applies to samples from deceased individuals who donate their organs to science. The quality of these samples is usually not as good as from living individuals.

Where did the categories post-mortem donor, organ donor, and surgical donor come from? Maybe it's worth checking more in that resource. My main point earlier was that already transplanted tissue (including allografts/xenografts) are not likely to be samples that end up in reference atlases as these look very different from "normal tissue".

arschat commented 11 months ago

Current snapshot of Tier 1 mapping:

arschat commented 11 months ago

Update on library_ID after EBI internal discussion today. We have the following options:

  1. library as a new biomaterial entity:
    • create a new entity, that has the following fields:
      1. library.biomaterial_core
      2. library.preparation_batch
      3. library.sequencing_batch
      4. cell_suspension.biomaterial_core.biomaterial_id
  2. cell_suspension as library:
    • instead of creating a new library biomaterial entity, we will use the cell_suspension entity to describe libraries
    • we will need two new fields in cell_suspension, to describe the library_preparation_batch & library_sequencing_batch
  3. batch module in analysis_file
    • create a batch module in analysis_file that contains library_ID, sequence_batch, cell_suspension_ID, analysis_file_ID
    • module will be in a separate tab (the way projects.publication does) but will also include an analysis_file_ID to connect back to specific analysis_file

Option 1 will result in multiple redesigns of ingest, import and data browser, and will need a lot of time to design and apply those changes. It was voted down by both UCSC and EBI wranglers.

Between option 2 and 3 we (EBI wranglers) decided to proceed with the option 2 since it will need less changes in the schema, and ingesting of the data, and will be more intuitive for the wrangler to fill compared with option 3.

arschat commented 10 months ago

Alignment software PR

After Hannes' comments, alignment_software & alignment_software_version was moved to analysis_protocol & converted to optional (alignment_software in dependentRequired for alignment_software_version)

arschat commented 9 months ago

Gene Annotation

FIeld gene_annotation_version

idazucchi commented 7 months ago

Pending discussion on the library id and sample source - schedule discussion with Tony @arschat

arschat commented 6 months ago

Current options for library_ID:

arschat commented 6 months ago

About organ_donor in sample_source we got the following clarification from Lucia & Malte.

If I recall correctly, this is relevant as post-mortem tissue that is no longer properly perfused shows a distinct transcriptomic effect (you never get tissue immediately after death, usually 2-3 hours is minimum). Organ donor samples are from individuals that are e.g., brain dead but where the organ is still kept alive so that no cellular degeneration is evident due to lack of perfusion. This looks more like living tissue than post-mortem. On the other hand, surgical tissue is from an individual who is still very much alive and may have e.g., eaten and metabolized food recently.

I'm not sure I would say "organ donor from a living subject", as organs will only be removed if the donor is declared brain-dead... which I'm not sure counts as living. Any bionetwork coordinator for bionetwork that collected tissue blocks of some sort will know this better than me though.

There are two options.

Second option is clearer and more robust, and information about surgical donor is recorded in the collection protocol.

About mapping from Tier 1 to DCP:

if sample_source == "organ donor":
   donor_organism.organ_donor = True
   donor_organism.is_living = False *
elif sample_source == "post-mortem donor":
   donor_organism.organ_donor = False
   donor_organism.is_living = False
elif sample_source == "surgical donor":
   donor_organism.organ_donor = False
   donor_organism.is_living = True

Although with Tier 1 modeling, there is ambiguity for the is_living option if the subject is an organ donor, for most organs/ bionetworks the subject should be considered deceased (ideally we could specify the donor_organism.death.organ_donation_death_type).

arschat commented 6 months ago

Last update on mapping here Template with tier 1 at row 4 here

arschat commented 6 months ago

Conversion from anndata tier 1 object, to DCP spreadsheet with a jupyter notebook and a interchangeable mapping dictionary here.

arschat commented 5 months ago

Conversion from Tier 1 to DCP moved here #1252

arschat commented 5 months ago

transplant_organ PR merged! Working on intron_inclusion PR.

idazucchi commented 5 months ago

intron_inclusion PR merged! last thing to discuss:

idazucchi commented 4 months ago

we are adding a field for sequencing batch - it will be imoprtant for projects deposited directly with HCA that don't have Run accessions

arschat commented 3 months ago

library_sequence_run: This could be either the ID of the sequence RUN or the sequence BATCH. Asked for clarification and got the following replies:

my assumption is that we're interested in the batch (as there shouldn't be any differences between runs if they're all being processed at the same time. However, I'm absolutely not an expert so need to check with someone and get back to you

Confirming that library_sequencing_run is a custom field. Also, Library_sequencing_run is the higher order term. Whereas, library_preparation_batch i think is the lower order term as it's referring to libraries sequenced on the same plate/chip

arschat commented 2 months ago

sequence_run_batch which is equivalent to library_sequence_run is now merged in prod. sample_collection_site can be recorded in the process.process_core.location.

Mapping of Tier 1 has been completed, referenced here.

Tier 1 fields that are not mapped:

batch_condition, default_embedding, comments, author_batch_notes, tissue_type, is_primary_data, author_cell_type, cell_type_ontology_term_id

All mapping here.

Tier 1 HCA metadata schema
title project.project_core.project_title
study_pi project.contributors.name
batch_condition NA
default_embedding NA
comments NA
sample_id specimen_from_organism.biomaterial_core.biomaterial_id
donor_id donor_organism.biomaterial_core.biomaterial_id
protocol_url library_preparation_protocol.protocol_core.protocols_io_doi
institute project.contributors.institute
sample_collection_site process.process_core.location
sample_collection_relative_time_point specimen_from_organism.biomaterial_core.timecourse.value
library_id cell_suspension.biomaterial_core.biomaterial_id
library_id_repository cell_suspension.biomaterial_core.biomaterial_name
author_batch_notes NA
organism_ontology_term_id donor_organism.biomaterial_core.ncbi_taxon_id
manner_of_death donor_organism.death.hardy_scale
sample_source donor_organism.is_living & specimen_from_organism.transplant_organ
sex_ontology_term_id donor_organism.sex
sample_collection_method collection_protocol.method.text
tissue_type NA
sampled_site_condition donor_organism.diseases.text & specimen_from_organism.diseases.text
tissue_ontology_term_id specimen_from_organism.organ.ontology
tissue_free_text specimen_from_organism.organ.text
sample_preservation_method specimen_from_organism.preservation_storage.storage_method
suspension_type library_preparation_protocol.nucleic_acid_source
cell_enrichment enrichment_protocol.markers
cell_viability_percentage cell_suspension.cell_morphology.percent_cell_viability
cell_number_loaded cell_suspension.estimated_cell_count
sample_collection_year specimen_from_organism.collection_time
assay_ontology_term_id library_preparation_protocol.library_construction_method.ontology
library_preparation_batch sequence_file.library_prep_id
library_sequencing_run sequence_run_batch
sequenced_fragment library_preparation_protocol.end_bias
sequencing_platform sequencing_protocol.instrument_manufacturer_model.text
is_primary_data NA
reference_genome analysis_file.genome_assembly_version
gene_annotation_version analysis_protocol.gene_annotation_version
alignment_software analysis_protocol.alignment_software_version
intron_inclusion analysis_protocol.intron_inclusion
author_cell_type NA
cell_type_ontology_term_id NA
disease_ontology_term_id donor_organism.diseases.ontology
self_reported_ethnicity_ontology_term_id donor_organism.human_specific.ethnicity.ontology
development_stage_ontology_term_id donor_organism.development_stage.ontology

This ticket can now close.