ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

CxG-Tier 1 to DCP notebook #1252

Open arschat opened 3 months ago

arschat commented 3 months ago

When Bionetworks provide Tier 1 metadata for CxG we should be able to create DCP spreadsheets that have their Tier 1 fields populated. We will need that for:

As action items, these are the following steps:

  1. create jupyter notebook(s) that pull data from CxG and
  2. exports DCP spreadsheet
  3. automate conditional field conversions

For future tasks we might want add functionalities like:

arschat commented 3 months ago

Notebooks here. Will create a git repo to showcase changes.

arschat commented 3 months ago

Repo created at arschat/tier1_to_dcp

arschat commented 2 months ago

Exact map fields

doi: project.publications[0].doi title: project.project_core.project_title, study_pi: project.contributors.name, sample_id: specimen_from_organism.biomaterial_core.biomaterial_id, donor_id: donor_organism.biomaterial_core.biomaterial_id, protocol_url: library_preparation_protocol.protocol_core.protocols_io_doi, library_id: cell_suspension.biomaterial_core.biomaterial_id, library_id_repository: cell_suspension.biomaterial_core.biomaterial_name, sample_collection_method: collection_protocol.method.text, tissue_ontology_term_id: specimen_from_organism.organ_parts.ontology, tissue_free_text: specimen_from_organism.organ_parts.text, sample_preservation_method: specimen_from_organism.preservation_storage.storage_method, suspension_type: library_preparation_protocol.nucleic_acid_source, cell_viability_percentage: cell_suspension.cell_morphology.percent_cell_viability, cell_number_loaded: cell_suspension.estimated_cell_count, sample_collection_year: specimen_from_organism.collection_time, assay_ontology_term_id: library_preparation_protocol.library_construction_method.ontology, library_preparation_batch: sequence_file.library_prep_id, sequenced_fragment: library_preparation_protocol.end_bias, sequencing_platform: sequencing_protocol.instrument_manufacturer_model.ontology, reference_genome: analysis_file.genome_assembly_version, gene_annotation_version: analysis_protocol.gene_annotation_version, intron_inclusion: analysis_protocol.intron_inclusion, disease_ontology_term_id: donor_organism.diseases.ontology, self_reported_ethnicity_ontology_term_id: donor_organism.human_specific.ethnicity.ontology,

Not implemented yet

institute @ sample level sample_collection_site

Conversion implemented

sample_collection_relative_time_point: specimen_from_organism.biomaterial_core.timecourse.value, organism_ontology_term_id: donor_organism.biomaterial_core.ncbi_taxon_id, sex_ontology_term_id: donor_organism.sex, manner_of_death: donor_organism.death.hardy_scale & donor_organism.is_living, sample_source: donor_organism.is_living & specimen_of_organism.transplant_organ, sampled_site_condition: specimen_from_organism.diseases.text, # if is healthy PATO, if adjacent PATO & adjacent disease_ontology_term_id, else disease_ontology_term_id alignment_software: analysis_protocol.alignment_software & analysis_protocol.alignment_software_version, library_sequencing_run: library_sequencing_run, # if library_sequencing_run is an INSDC accession cell_enrichment: enrichment_protocol.markers, # if CL ontology add CL label development_stage_ontology_term_id: donor_organism.organism_age Automatic assignment of protocol_ids to biomaterials & files

Not implemented, with no planned implementation for now

tissue_type

batch_condition default_embedding comments author_batch_notes is_primary_data author_cell_type cell_type_ontology_term_id

arschat commented 2 weeks ago

Stalled until sequencing_run_id is pushed