Open kdaily opened 5 years ago
The lower MEP levels are cell-level, but only the IF/cycIF data is cell-level, so I think it makes sense to go with something like the GDC levels, which are
Level Number | Definition |
---|---|
1 | Raw data |
2 | Normalized data |
3 | Aggregated data |
4 | Regions of Interest data |
0 | No Level |
I don't think that these descriptions exactly describe the data levels we want to make available, though. I am thinking that we want to have something more like this:
Data level | Definition | RNAseq | ATACseq | RPPA | cycIF | GCP | L1000 | IF |
---|---|---|---|---|---|---|---|---|
1 | raw data | bam | bam | ? | images | ? | ? | images |
2 | quantified data | integer counts | integer counts | ? | cell data, no batch correction | ? | Level 2/ GEX data | cell data, no batch correction |
3 | normalized data | log2(fpkm + 1) | log2(fpkm + 1) | data sent by MD Anderson (excel sheet) | cell data, batch corrected | unnormalized data from Jake Jaffe | Level 3/ QNORM data | cell data, batch corrected |
4 | aggregated normalized data | NA | NA | NA | well-level data, batch corrected | NA | NA | well-level data, batch corrected |
5 | EGF normalized | … | ||||||
6 | EGF normalized, median summarized | ... |
@markdane what do you think?
@kdaily , Mark and I discussed the leveling scheme this morning. We settled on this, which follows the GDC levels a bit better:
Data level | Definition | RNAseq | ATACseq | RPPA | cycIF | GCP | L1000 | IF |
---|---|---|---|---|---|---|---|---|
0 | No Level | bam | bam | MD Anderson excel file | raw images | GCP_MCF10a_raw.gct file | LISS-normalized gctx file | raw images |
1 | Raw data | integer counts | integer counts | RawLog2 sheet from excel (tsv) | quantified features, no batch correction | GCP_MCF10a_raw.gct file as tsv | LISS-normalized gctx file as tsv | quantified features, no norm |
2 | Normalized data | log2(fpkm + 1) | log2(fpkm + 1) | normlog2 sheet from excel (tsv) | batch-corrected cell-level features | MS1-normalized GCP | quantile-normalized data | batch-corrected cell-level features |
3A | aggregated to sample level | log2(fpkm + 1) | log2(fpkm + 1) | normlog2 sheet from excel (tsv) | well-level batch-corrected features | MS1-normalized GCP | quantile-normalized data | well-level features |
3B | aggregated to condition | condition median/mean | … | |||||
4A | EGF normalized | EGF-normalized level 3A | … | |||||
4B | EGF normalized, median summarized | condition median/mean of 4A | … |
Some notes on that..
level 0
: The rawest form of the data that we have. This would be the images, bam files, and any raw data we received from other groups (e.g. the Excel files from MD Anderson)
level 1
: The rawest data that has been quantified. Everything at this level (and beyond) should be a matrix. Some data at this level will be identical to what is in level 0, but reformatted (no excel files).
level 2
: Level 1 data with normalization. Sequencing data as log2(fpkm + 1) matrices, IF/cycIF data batch corrected and at cell level
level 3A
: Level 2 data with IF/cycIF aggregated from the cell-level to the well/sample-level. For non-imaging data, this level is identical to level 2.
level 3B
: Level 3A data mean/median summarized to the treatment level.
level 4A
: Level 3A data normalized to replicate-specific EGF conditions
level 4B
: Level 4A data mean/median summarized to the treatment level.
Thanks! I think I can apply this.
I don't see any integer count files for ATACseq. Can you point me to those? The folder called 'no normalization' doesn't have integers in it. https://www.synapse.org/#!Synapse:syn18485477
need help with identifying which files go with which level for RNAseq (beyond level 0/1).
In going through this, having levels with sub-levels is very confusing. suggest collapsing to the same integer leve.
I've started adding levels. We made up levels for MEP data; the GDC now uses different levels (e.g.,
1
is raw data and goes from there). Since much of this data is in the genomics realm now, do we want to use that scheme? Or, stay with the MEP-based scheme?