MEP-LINCS / knowledgeportal

Tracking issues related to https://www.synapse.org/mep_lincs
1 stars 0 forks source link

decide on leveling scheme #47

Open kdaily opened 5 years ago

kdaily commented 5 years ago

I've started adding levels. We made up levels for MEP data; the GDC now uses different levels (e.g., 1 is raw data and goes from there). Since much of this data is in the genomics realm now, do we want to use that scheme? Or, stay with the MEP-based scheme?

danielderrick commented 5 years ago

The lower MEP levels are cell-level, but only the IF/cycIF data is cell-level, so I think it makes sense to go with something like the GDC levels, which are

Level Number Definition
1 Raw data
2 Normalized data
3 Aggregated data
4 Regions of Interest data
0 No Level

I don't think that these descriptions exactly describe the data levels we want to make available, though. I am thinking that we want to have something more like this:

Data level Definition RNAseq ATACseq RPPA cycIF GCP L1000 IF
1 raw data bam bam ? images ? ? images
2 quantified  data integer counts integer counts ? cell data, no batch correction ? Level 2/ GEX data cell data, no batch correction
3 normalized data log2(fpkm + 1) log2(fpkm + 1) data sent by MD Anderson (excel sheet) cell data, batch corrected unnormalized data from Jake Jaffe Level 3/ QNORM data cell data, batch corrected
4 aggregated normalized data NA NA NA well-level data, batch corrected NA NA well-level data, batch corrected
5 EGF normalized            
6 EGF normalized, median summarized ...             

@markdane what do you think?

danielderrick commented 5 years ago

@kdaily , Mark and I discussed the leveling scheme this morning. We settled on this, which follows the GDC levels a bit better:

Data level Definition RNAseq ATACseq RPPA cycIF GCP L1000 IF
0 No Level bam bam MD Anderson excel file raw images GCP_MCF10a_raw.gct file LISS-normalized gctx file raw images
1 Raw data integer counts integer counts RawLog2 sheet from excel (tsv) quantified features, no batch correction GCP_MCF10a_raw.gct file as tsv LISS-normalized gctx file as tsv quantified features, no norm
2 Normalized data log2(fpkm + 1) log2(fpkm + 1) normlog2 sheet from excel (tsv) batch-corrected cell-level features MS1-normalized GCP quantile-normalized data batch-corrected cell-level features
3A aggregated to sample level log2(fpkm + 1) log2(fpkm + 1) normlog2 sheet from excel (tsv) well-level batch-corrected features MS1-normalized GCP quantile-normalized data well-level features
3B aggregated to condition condition median/mean          
4A EGF normalized EGF-normalized level 3A          
4B EGF normalized, median summarized condition median/mean of 4A          

Some notes on that..

kdaily commented 5 years ago

Thanks! I think I can apply this.

kdaily commented 5 years ago

I don't see any integer count files for ATACseq. Can you point me to those? The folder called 'no normalization' doesn't have integers in it. https://www.synapse.org/#!Synapse:syn18485477

kdaily commented 5 years ago

need help with identifying which files go with which level for RNAseq (beyond level 0/1).

kdaily commented 5 years ago

In going through this, having levels with sub-levels is very confusing. suggest collapsing to the same integer leve.