databio / bedboss

Python pipeline for processing BED files for BEDbase
https://docs.bedbase.org
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Bedbase PEP schema in pephub #29

Closed khoroshevskyi closed 5 months ago

khoroshevskyi commented 7 months ago

Steps:

  1. Investigate what metadata we need, and what standards to use
  2. Implement 3d party standards, which are right now in development

Useful information: https://github.com/fairtracks/fairtracks_standard#overview-of-structure-of-the-fairtracks-standard here's the full blueprint metadata: https://github.com/fairtracks/fairtracks_standard/tree/master/json/blueprint alongside encode, we should also add all blueprint files to bedbase

donaldcampbelljr commented 7 months ago

Here's the actual JSOn Schema used (that is referenced in one of the above blueprints): https://raw.githubusercontent.com/fairtracks/fairtracks_standard/v1/current/json/schema/fairtracks.schema.json

nsheff commented 6 months ago

Can the two of you solidify an initial first schema? Ideally just use the fairtracks standard, if it fits.

donaldcampbelljr commented 6 months ago

We have a minimal input schema already: https://schema.databio.org/?namespace=pipelines&schema=bedboss

We are currently deciding on input schemas for Encode, Fairtracks and a Minimal Output Schema.

donaldcampbelljr commented 6 months ago

For Minimal Encode Input Schema we have narrowed it down from 59 columns found in the raw Encode Data to 20 columns:

File accession - experiment global ID File type File format type Output type - make the header more specific File assembly - genome / ref genome Experiment accession - experiment sample ID Assay - experiment protocol Biosample term id Biosample term name Biosample type Biosample organism Biosample treatments Biosample genetic modifications methods Biosample genetic modifications categories Biosample genetic modifications targets MAYBE will have to dump JSON and it will add more columns Experiment target Library made from - sample molecule (GEO), library source Experiment date released Project File Download URL

khoroshevskyi commented 6 months ago

After discussion, we have minimal output schema:

Minimal output schema: 
- sample_name [Required] 
- genome  [Required] (e.g. hg38)
- bed_type [Required] (e.g. bed3)
- format_type [Required] (e.g. narrowPeak)
- organism (e.g. Homo sapiens)
- species_id (e.g. 9606 )
- cell_type (e.g. K562)
- cell_line (e.g. C4-2B)
- exp_protocol (e.g. DNase-seq, TF ChIP-seq, histone ChIP-seq, ATAC-seq)
- library_source (e.g. genomic DNA, DNA-encode)
- target (e.g. H3K36me3)
- antibody (anti-H3K36me3) 
- tissue (e.g blood / liver / brain)
- global_sample_id (e.g. "encode:ENCBS192PUU", "geo:GSM1234")
- global_experiment_id  "encode:ENC00000", "geo:GSE1234"
- description
- file_url [Required]
- file_name [Required]
nsheff commented 6 months ago

Ok, can you use JSON-schema format? Add type, description, and required annotations

nsheff commented 6 months ago

some of these are specific to a particular protocol; eg 'antibody' makes sense for ChIP-seq but not for ATAC-seq.

So that wouldn't really be 'minimal', it would be an extension of the schema specific to ChIP-seq

khoroshevskyi commented 6 months ago

This is fairtrack outut schema for discussion:


Fairtrack schema:
- global_id: (e.g. encode:ENCBS13333)
- local_id: (e.g. fghjkjhgfdfghjhgfdfghj) [Required]
- species_id: (e.g. "taxonomy:9606") [Required] - identifiers.org (NCBI format)
- species_name: (e.g. Homo Sapiens)
- biospecimen_class: ????? (json_obj) [Required]
- Sample_type: [Required] - This is an object
  - cell_type: (e.g. K562) 
  - abnormal_cell_type: (e.g.)
  - cell_line: (e.g. C4-2B)
  - organism_part: (e.g. liver)
- phenotype [Required]: Main phenotype (e.g. leukemia) 

p.s. It's seems impossible to have all this items for each sample, as this info doesn't exist in each geo sample 

Additional columns from experiment:
- technique [Required]: "Main technique used in experiment (e.g., laboratory, computational or statistical technique)" (We probably don't need that)
- target [Required]: "Main target of the experiment" (e.g. H3K4_trimethylation)
- study_reference [Required]: 
- gene_id: HGNC identifier for gene targeted by the experiment (e.g., specific transcription factor used as ChIP-seq antibody).
- gene_product_type: Gene product type targeted by the experiment (e.g., miRNA)
- macromolecular_structure: Macromolecular structure targeted by the experiment (e.g., chromatin strucure)
- lab_protocol_description: "Free-text description of lab protocol, or URL to such description"
- compute_protocol_description: Free-text description of computational protocol, or URL to such description

Additional columns from track:
- assembly_id [Required]: Genome assembly identifier: "insdc.gca:GCF_000001405.26"
- assembly_name[Required]: Genome assembly name: "GRCh38"
- experiment_ref[Required]: Reference to the experiment of the track (e.g., "encode:ENCSR000DQP")
- file_url[Required]: URL to the file
- file_name: Name of the file (e.g., "ENCFF000VZC.bed.gz")
- label_short[Required]: A short label of the track file. Suggested maximum length is 25 characters
- label_long[Required]: A long label of the track file. Suggested maximum length is 80 characters
- file_format[Required]: File format (e.g., "bed", "narrowPeak", "broadPeak")
- type_of_condensed_data[Required]: (e.g. "Narrow peaks")
- geometric_track_type[Required]: (e.g. Segments)
- checksum[Required]: Method of checksum generation. (eg. MD5)
khoroshevskyi commented 6 months ago

some of these are specific to a particular protocol; eg 'antibody' makes sense for ChIP-seq but not for ATAC-seq.

So that wouldn't really be 'minimal', it would be an extension of the schema specific to ChIP-seq

Required fields are minimal output schema

khoroshevskyi commented 6 months ago

We should store standardized metadata that has schema in bedbase database (metadata that we generated), but all other info in pephub