decide on leveling scheme

kdaily commented 5 years ago

I've started adding levels. We made up levels for MEP data; the GDC now uses different levels (e.g., 1 is raw data and goes from there). Since much of this data is in the genomics realm now, do we want to use that scheme? Or, stay with the MEP-based scheme?

danielderrick commented 5 years ago

The lower MEP levels are cell-level, but only the IF/cycIF data is cell-level, so I think it makes sense to go with something like the GDC levels, which are

Level Number	Definition
1	Raw data
2	Normalized data
3	Aggregated data
4	Regions of Interest data
0	No Level

I don't think that these descriptions exactly describe the data levels we want to make available, though. I am thinking that we want to have something more like this:

Data level	Definition	RNAseq	ATACseq	RPPA	cycIF	GCP	L1000	IF
1	raw data	bam	bam	?	images	?	?	images
2	quantified data	integer counts	integer counts	?	cell data, no batch correction	?	Level 2/ GEX data	cell data, no batch correction
3	normalized data	log2(fpkm + 1)	log2(fpkm + 1)	data sent by MD Anderson (excel sheet)	cell data, batch corrected	unnormalized data from Jake Jaffe	Level 3/ QNORM data	cell data, batch corrected
4	aggregated normalized data	NA	NA	NA	well-level data, batch corrected	NA	NA	well-level data, batch corrected
5	EGF normalized	…
6	EGF normalized, median summarized	...

@markdane what do you think?

danielderrick commented 5 years ago

@kdaily , Mark and I discussed the leveling scheme this morning. We settled on this, which follows the GDC levels a bit better:

Data level	Definition	RNAseq	ATACseq	RPPA	cycIF	GCP	L1000	IF
0	No Level	bam	bam	MD Anderson excel file	raw images	GCP_MCF10a_raw.gct file	LISS-normalized gctx file	raw images
1	Raw data	integer counts	integer counts	RawLog2 sheet from excel (tsv)	quantified features, no batch correction	GCP_MCF10a_raw.gct file as tsv	LISS-normalized gctx file as tsv	quantified features, no norm
2	Normalized data	log2(fpkm + 1)	log2(fpkm + 1)	normlog2 sheet from excel (tsv)	batch-corrected cell-level features	MS1-normalized GCP	quantile-normalized data	batch-corrected cell-level features
3A	aggregated to sample level	log2(fpkm + 1)	log2(fpkm + 1)	normlog2 sheet from excel (tsv)	well-level batch-corrected features	MS1-normalized GCP	quantile-normalized data	well-level features
3B	aggregated to condition	condition median/mean	…
4A	EGF normalized	EGF-normalized level 3A	…
4B	EGF normalized, median summarized	condition median/mean of 4A	…

Some notes on that..

level 0: The rawest form of the data that we have. This would be the images, bam files, and any raw data we received from other groups (e.g. the Excel files from MD Anderson)
level 1: The rawest data that has been quantified. Everything at this level (and beyond) should be a matrix. Some data at this level will be identical to what is in level 0, but reformatted (no excel files).
level 2: Level 1 data with normalization. Sequencing data as log2(fpkm + 1) matrices, IF/cycIF data batch corrected and at cell level
level 3A: Level 2 data with IF/cycIF aggregated from the cell-level to the well/sample-level. For non-imaging data, this level is identical to level 2.
level 3B: Level 3A data mean/median summarized to the treatment level.
level 4A: Level 3A data normalized to replicate-specific EGF conditions
level 4B: Level 4A data mean/median summarized to the treatment level.

kdaily commented 5 years ago

Thanks! I think I can apply this.

kdaily commented 5 years ago

I don't see any integer count files for ATACseq. Can you point me to those? The folder called 'no normalization' doesn't have integers in it. https://www.synapse.org/#!Synapse:syn18485477

kdaily commented 5 years ago

need help with identifying which files go with which level for RNAseq (beyond level 0/1).

kdaily commented 5 years ago

In going through this, having levels with sub-levels is very confusing. suggest collapsing to the same integer leve.

MEP-LINCS / knowledgeportal

decide on leveling scheme #47