YeoLab / flotilla

Reproducible machine learning analysis of gene expression and alternative splicing data
http://yeolab.github.io/flotilla/docs
BSD 3-Clause "New" or "Revised" License
121 stars 26 forks source link

Allow for any number of expression or splicing datasets on a study #319

Open olgabot opened 8 years ago

olgabot commented 8 years ago

Some studies may consist of multiple sets of gene expression datasets (RNA-Seq, RT-PCR) or splicing datasets (Percent spliced-in on the 5' side and the 3' side of the intron, separately). The idea is that the expression datasets can be treated similarly, i.e. have similar assumptions about the data type (log-normal ish distribution for expression, between 0 and 1 for splicing), and can use the same underlying ExpressionData or SplicingData methods, but will act on separate underlying datasets. This can similarly be extended for location-style data types like ChIP-Seq, CLIP-Seq, Methyl-seq, RNA editing, etc.

I envision implementing this in the datapackage as:

{
  "name": "million_dollar_dataset", 
  "title": null, 
  "datapackage_version": "0.1.0", 
  "sources": null, 
  "licenses": null, 
  "species": "hg19", 
  "resources": [
    {
      "path": "psi5.csv.gz", 
      "format": "csv", 
      "data_type": "splicing",
      "name": "psi5", 
      "compression": "gzip"
    }, 
    {
      "name": "psi5_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "psi5_feature.csv.gz", 
      "expression_id_col": "one_ensembl_id", 
      "compression": "gzip"
    }, 
    {
      "path": "psi3.csv.gz", 
      "format": "csv", 
      "data_type": "splicing",
      "name": "psi3", 
      "compression": "gzip"
    }, 
    {
      "name": "psi3_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "psi3_feature.csv.gz", 
      "expression_id_col": "one_ensembl_id", 
      "compression": "gzip"
    }, 
    {
      "path": "rtpcr.csv.gz", 
      "format": "csv", 
      "data_type": "expression",
      "name": "rtpcr", 
      "compression": "gzip"
    }, 
    {
      "name": "rtpcr_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "path": "rtpcr_feature.csv.gz", 
      "compression": "gzip"
    }, 
    {
      "path": "rnaseq.csv.gz", 
      "format": "csv", 
      "data_type": "expression",
      "name": "rnaseq", 
      "compression": "gzip"
    }, 
    {
      "name": "rnaseq_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "rnaseq_feature.csv.gz", 
      "compression": "gzip"
    }, 
    {
      "name": "mapping_stats", 
      "format": "csv", 
      "min_reads": 1000000.0, 
      "path": "mapping_stats.csv.gz", 
      "number_mapped_col": "Uniquely mapped reads number", 
      "compression": "gzip"
    }, 
    {
      "path": "gene_ontology.csv.gz", 
      "format": "csv", 
      "name": "gene_ontology", 
      "compression": "gzip"
    }, 
    {
      "name": "metadata", 
      "format": "csv", 
      "minimum_samples": 20, 
      "path": "metadata.csv.gz", 
      "phenotype_col": "phenotype", 
      "compression": "gzip"
    }
  ]
}

Notice {rnaseq,rtpcr,psi3,psi5}_feature are the feature metadata objects for the datasets. They may change to be stored within the {rnaseq,rtpcr,psi3,psi5} entries.

This datapackage would then produce a Study object where you can do:

study.plot_rnaseq('RBFOX2')
stidy.plot_rtpcr('ACTB')
study.plot_psi3("NRXN1")
study.plot_psi5("PKM")

It's going to be hard to implement this such that initializing from Study isn't too complicated, because even the current implementation is really all over the place.

This feature will really change the game because it will make flotilla a true one-stop-shop for all your needs with a particular dataset.