brain-bican / models

BICAN data models
https://brain-bican.github.io/models/
3 stars 3 forks source link

New csv converter #23

Closed djarecka closed 2 months ago

djarecka commented 5 months ago

Replacement for #19. This is a new converter that create linkm yaml file based on:

When used for the csv from the gdoc format, it would give the following yaml model

yaml file content ``` id: https://identifiers.org/brain-bican/kb-model name: kb-model prefixes: linkml: https://w3id.org/linkml/ BIOLINK: https://raw.githubusercontent.com/biolink/biolink-model/latest/ bican: https://identifiers.org/brain-bican/vocab/ spdx: http://spdx.org/rdf/terms# schema: http://schema.org/ ncbi: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi imports: - linkml:types - BIOLINK:biolink-model default_range: string default_prefix: bican subsets: bican: description: A subset of classes that are associated with BICAN. gars: description: A subset of classes that are associated with GARS. tissue_specimen: description: A subset of classes that are associated with tissue specimens. library_generation: description: A subset of classes that are associated with library generation. sequencing_elements: description: A subset of classes that are associated with sequencing. processing_elements: description: A subset of classes that are associated with processing. analysis: description: A subset of slots/attributes that are required for analysis. tracking: description: A subset of slots/attributes that are required for tracking. alignment: description: A subset of slots/attributes that are required for alignment. classes: prov activity: mixin: true description: Based off prov:Activity; an activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities. in_subset: - bican slots: - used - wasAssociatedWith prov entity: mixin: true description: Based off prov:Entity; an entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. in_subset: - bican slots: - wasDerivedFrom - wasGeneratedBy - wasAttributedTo aligned data: is_a: entity mixins: - prov entity in_subset: - bican - processing_elements alignment: description: A process that takes sequenced data and produces aligned data (cell x gene matrix). is_a: procedure mixins: - prov activity in_subset: - bican - sequencing_elements amplified cdna: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: name: description: Name of a collection of cDNA molecules derived and amplified from an input barcoded_cell_sample. These cDNA molecules represent the gene expression of each cell, with all cDNA molecules from a given cell retaining that cell's unique barcode from the cell barcoding step. This is a necessary step for GEX methods but is not used for ATAC methods. aliases: - amplified cDNA label - amplified cdna name amplified quantity ng: description: Amount of cDNA produced after cDNA amplification measured in nanograms range: float aliases: - amplified cDNA amplified quantity ng PCR cycles: description: Number of PCR cycles used during cDNA amplification for this cDNA. range: integer aliases: - amplified cDNA PCR cycles - cDNA amplification cycles process date: description: Date of cDNA amplification. range: date aliases: - cDNA amplification process date - cDNA amplification date pass fail result: description: Pass or Fail result based on qualitative assessment of cDNA yield and size. range: Pass-FailResult aliases: - amplified cDNA RNA amplification pass-fail - cDNA amplification pass-fail result percent cDNA longer than 400bp: description: QC metric to measure mRNA degradation of cDNA. Higher % is higher quality starting material. Over 400bp is used as a universal cutoff for intact (full length) vs degraded cDNA and is a common output from Bioanalyzer and Fragment Analyzer elecropheragrams. range: float aliases: - amplified cDNA percent cDNA longer than 400bp - cDNA amplification percent cDNA greater than 400bp set: description: cDNA amplification set, containing multiple amplified_cDNA_names that were processed at the same time. aliases: - cDNA amplification set slot_usage: wasGeneratedBy: range: cdna amplification barcoded cell sample: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: port well: description: Specific position of the loaded port of the 10x chip. An Enriched or Dissociated Cell Sample is loaded into a port on a chip (creating a Barcoded Cell Sample). Can be left null for non-10x methods. aliases: - barcoded cell sample port well - 10x chip port well input quantity: description: Number of enriched or dissociated cells/nuclei going into the barcoding process. range: integer aliases: - barcoded cell input quantity count name: description: Name of a collection of barcoded cells. Input will be either dissociated_cell_sample or enriched_cell_sample. Cell barcodes are only guaranteed to be unique within this one collection. One dissociated_cell_sample or enriched_cell_sample can lead to multiple barcoded_cell_samples. aliases: - barcoded cell sample label - barcoded cell sample name required: true expected cell capture: description: Expected number of cells/nuclei of a barcoded_cell_sample that will be barcoded and available for sequencing. This is a derived number from 'Barcoded cell input quantity count' that is dependent on the "capture rate" of the barcoding method. It is usually a calculated fraction of the 'Barcoded cell input quantity count' going into the barcoding method. range: integer study set: description: Intended cohort or dataset that the Barcoded Cell Sample initially belongs to. This Study helps to group together samples that are meant to be analyzed together. Multiple Studies can be assigned to a Barcoded Cell Sample. These studies are more granular than the grant or PI and can be used to group together samples from related ROIs. aliases: - study sets slot_usage: wasGeneratedBy: range: cell barcoding brain extraction: description: A process that takes a brain sample from a donor and produces a brain segment. is_a: material sample mixins: - prov entity in_subset: - bican - tissue_specimen brain section: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation brain segment: is_a: material sample mixins: - prov entity in_subset: - bican - tissue_specimen brain segment sectioning: description: A process that takes a brain segment and produces a brain section. is_a: procedure mixins: - prov activity in_subset: - bican - tissue_specimen cdna amplification: description: A process that takes a barcoded cell sample and produces an amplified cDNA sample. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation cell barcoding: description: A process that takes an enriched cell sample and produces a barcoded cell sample. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation cell dissociation: description: A process that takes a tissue sample and produces a dissociated cell sample. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation cell enrichment: description: A process that takes a dissociated cell sample and produces an enriched cell sample. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation dissociated cell sample: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: cell prep type: description: 'The type of cell preparation. For example: Cells, Nuclei. This is a property of dissociated_cell_sample.' range: CellPrepType aliases: - dissociated cell sample cell prep type required: true name: description: Name of a collection of dissociated cells or nuclei derived from dissociation of a tissue sample. aliases: - dissociated cell sample label - dissociated cell sample name oligo tag name: description: Name of oligo used in cell plexing. The oligo will tag allow separate dissociated cell samples to be combined downstream in the barcoded cell sample. The oligo name is associated with a sequence in a lookup table. This sequence will be needed to during analysis, after alignment, to associate reads with parent dissociated cell sample. aliases: - dissociated cell oligo tag name slot_usage: wasGeneratedBy: range: cell dissociation donor: description: A person or organism that is the source of a biological sample. is_a: agent mixins: - thing with taxon enriched cell sample: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: container name: description: Name of container (strip or tube or plate) of the enriched_cell_prep. This container could contain 1 or more enriched_cell_samples. aliases: - enriched cell sample container name name: description: Name of collection of enriched cells or nuclei after enrichment process (usually via FACS using the Enrichment Plan) applied to dissociated_cell_sample. aliases: - enriched cell sample name population: description: Actual percentage of cells as a result of using set of fluorescent marker label(s) to enrich dissociated_cell_sample with desired mix of cell populations. This plan can also be used to describe 'No FACS' where no enrichment was performed. This is a property of enriched_cell_prep_container. aliases: - enrichment population slot_usage: wasGeneratedBy: range: split library: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: avg size bp: description: Average size of the library in terms of base pairs. This is used to calculate the molarity before pooling and sequencing. range: integer aliases: - library avg size bp method: description: Standardized nomenclature to describe the library method used. This specifies the alignment method required for the library. For example, 10xV3.1 (for RNASeq single assay), 10xMult-GEX (for RNASeq multiome assay), and 10xMult-ATAC (for ATACSeq multiome assay) range: LibraryMethod aliases: - library method - library chemistry method required: true concentration nm: description: Concentration of library in terms of nM (nMol/L). Number of molecules is needed for accurate pooling of the libraries and for generating the number of target reads/cell in sequencing. range: float aliases: - library concentration nm creation date: description: Date of library construction range: date aliases: - library creation date - library construction date input quantity ng: description: Amount of cDNA going into library construction in nanograms. range: integer aliases: - library input ng name: description: Name of a library, which is a collection of fragmented and barcode-indexed DNA molecules for sequencing. An index or barcode is typically introduced to enable identification of library origin to allow libraries to be pooled together for sequencing. aliases: - library label - library name pass fail result: description: Pass or Fail result based on qualitative assessment of library yield and size. range: Pass-FailResult aliases: - library prep pass-fail - library prep pass-fail result prep set: description: Library set, containing multiple library_names that were processed at the same time. aliases: - library prep set quantity fmol: description: Amount of library generated in terms of femtomoles range: integer aliases: - library quantification fmol quantity ng: description: Amount of library generated in terms of nanograms range: integer aliases: - library quantification ng r1 r2 index: description: Name of the pair of library indexes used for sequencing. Indexes allow libraries to be pooled together for sequencing. Sequencing output (fastq) are demultiplexed by using the indexes for each library. The name will be associated with the sequences of i7, i5, and i5as, which are needed by SeqCores for demultiplexing. The required direction of the sequence (sense or antisense) of the index can differ depending on sequencing instruments. aliases: - R1/R2 index name required: true slot_usage: wasGeneratedBy: range: library construction library aliquot: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: name: description: One library in the library pool. Each Library_aliquot_name in a library pool will have a unique R1/R2 index to allow for sequencing together then separating the sequencing output by originating library aliquot through the process of demultiplexing. The resulting demultiplexed fastq files will include the library_aliquot_name. aliases: - library aliquot label - library aliquot name required: true slot_usage: wasGeneratedBy: range: library aliquoting library aliquoting: description: A process that takes a library and produces an library aliquot. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation library construction: description: A process that takes an amplified cDNA sample and produces a library. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation library pool: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: internal name: description: Library Pool Tube local name. Label of the tube containing the library pool, which is made up of multiple library_aliquots. This is a Library Lab local tube name, before the pool is aliquoted to the Seq Core provided tube 'Library Pool Tube Name'. aliases: - library pool tube internal label - libray pool tube local name required: true embargo date: description: date until which data much be embargoed barcode: description: Library Pool tube name as provided by the SeqCore (often a barcode). This tube is provided from the SeqCore and is part of the SeqCore tracking system. aliases: - SeqCore library pool tube barcode required: true name: description: Library lab's library pool name. For some labs this may be the same as "Libray pool tube local name". Other labs distinguish between the local tube label of the library pool and the library pool name provided to SeqCore for tracking. Local Pool Name is used to communicate sequencing status between SeqCore and Library Labs. aliases: - library pool label - local pool name required: true avg size bp: description: Average insert size of library pool, measured in base pairs. range: integer aliases: - library pool tube avg size bp - library pool avg size bp quantity fmol: description: Amount of library pool in the tube as measured in femtamoles (fmol) range: float aliases: - library pool fmol loading concentration pM: description: Sequencer Loading Concentration as measured in pM (pmol/L). This is a value used by the SeqCore. range: float read2 length: description: Separate field to replace the combined field "Sequencing cycle". Sequencing Cycle is needed for sequencing the library pool. The sequencing cycle needed is specific to the Library Chemistry Method and is required instruction to the SeqCores. range: integer aliases: - length of Read 2 (for Paired End Runs) required: true index1 length: description: Separate field to replace the combined field "Sequencing cycle". Sequencing Cycle is needed for sequencing the library pool. The sequencing cycle needed is specific to the Library Chemistry Method and is required instruction to the SeqCores. range: integer aliases: - length of Index 1 (i7 Primer) required: true index2 length: description: Separate field to replace the combined field "Sequencing cycle". Sequencing Cycle is needed for sequencing the library pool. The sequencing cycle needed is specific to the Library Chemistry Method and is required instruction to the SeqCores. range: integer aliases: - length of Index 2 (i5 Primer) required: true read1 length: description: Separate field to replace the combined field "Sequencing cycle". Sequencing Cycleis needed for sequencing the library pool. The sequencing cycle needed is specific to the Library Chemistry Method and is required instruction to the SeqCores. range: integer aliases: - length of Read 1 required: true concentration nM: description: Library pool concentration as measured in nanomolarity (nMol/L) range: float aliases: - library pool tube contents nM - library pool concentration (nM) volume ul: description: Library pool volume as measured in ul range: integer aliases: - library pool tube volume ul PhiX spike in percent: description: PhiX spike-in percent desired to be added to the library pool for sequencing. PhiX is used to increase complexity of the sample being sequenced, to reduce sequencing artifacts maintain sequencing quality on the instruments. This is an optional instruction to the SeqCore. range: float custom primers: description: Custom sequencing primers if needed, indicate with reads require them (R1/R2/i7/i5) range: boolean slot_usage: wasGeneratedBy: range: library pooling library pooling: description: A process that takes a library aliquot and produces a library pool. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation roi delineation: description: A process that takes a brain section and produces a region of interest polygon. is_a: procedure mixins: - prov activity in_subset: - bican - tissue_specimen roi polygon: is_a: entity mixins: - prov entity in_subset: - bican - tissue_specimen sequenced data: is_a: entity mixins: - prov entity in_subset: - bican - sequencing_elements sequencing: description: A process that takes a library pool and produces sequenced data (FASTQ files). A set of FASTQ files (R1, R2, R3, l1, l2) for each aliquot in a library pool "demultiplexed fastqs". is_a: procedure mixins: - prov activity in_subset: - bican - sequencing_elements split: description: A process that takes an enriched cell sample and produces a split enriched cell sample. is_a: procedure mixins: - prov activity in_subset: - bican - library_generation tissue dissection: description: A process that takes a brain section and produces a tissue sample. is_a: procedure mixins: - prov activity in_subset: - bican - tissue_specimen tissue sample: is_a: material sample mixins: - prov entity in_subset: - bican - library_generation attributes: name: description: Identifier name for final intact piece of tissue before cell or nuclei prep. This piece of tissue will be used in dissociation and has an ROI associated with it. aliases: - tissue sample label - tissue name|tissue sample label required: true slot_usage: wasGeneratedBy: range: tissue dissection enums: CellPrepType: permissible_values: cell: null nuclei: null LibraryMethod: permissible_values: 10xMult-GEX: description: RNASeq multiome assay 10xMult-ATAC: description: ATACSeq multiome assay Pass-FailResult: permissible_values: pass: description: passed result fail: description: failed result ``` <\details>
djarecka commented 5 months ago
  • Would be good to have a round trip test in here (csv2yaml + yaml2csv should return the same csv, and yaml2csv + csv2yaml should return the same yaml). I forgot that yaml2csv is part of this branch, it hasn't been updated to the new tables

  • There is no validation of the csv or yaml in the code. how would you detect issues before conversion? and similarly yaml issues before converting back You mean validation of specific fields or that this is a proper csv file?

  • This would be a really good project to discuss/hack on during the linkml hackashop. Seems this could use the linkml library in various places.

  • perhaps rename the scripts with something bican specific and put it in bkbit. also a single entrypoint that converts in either direction. I could move this to bkbit if you want, but since this is only to create a model, I thought this belong to this repo to the utils section