Open MattiasSehlstedt opened 3 years ago
The datasets are https://github.com/broadinstitute/neural-profiling/blob/main/pre-trained/efficient_net/aggregated/aggregated_efficientnet_median.csv and https://github.com/broadinstitute/lincs-cell-painting/blob/master/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select_dmso.csv.gz
Yea this is confusing because those are two different stages of data. Level 5 above and level 3 below.
See: https://github.com/broadinstitute/lincs-cell-painting/tree/master/profiles https://github.com/broadinstitute/neural-profiling/wiki/01_Baseline
Top: Technical replicas are already aggregated so the 6 different dosages are visible.
Bottom: Technical replicas are visible here. The other doses have been deleted. See my subselection notebook
@MattiasSehlstedt Hope that makes it clear.
Also, if you use @ symbols, I will respond faster next time :)
@michaelbornholdt So I guess that would mean that I would either have to modz your efficientnet data or work with https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/2016_04_01_a549_48hr_batch1.dvc if I want a one-to-one row relation between some CellProfiler data and your Efficientnet data?
Yes correct! it depends on what kind of analysis you wanna do. If you are running somth like Enrichment which compare compounds (level 5) then just aggregate the Efficientnet profiles.
Also. You should be using the Spherized CP data instead of the non spherized.
What is the reason for there existing several replications across several different "Metadata_Plate" in the efficientnet data, while this doesn't exist within the CellProfiler data?
If one loads each of the two datasets, and runs the query
display(df[df.Metadata_broad_sample == 'BRD-K05804044-001-06-0']))
then the CellProfiler data will return 6 lines, where the difference between them is the dose concentration.The Efficientnet data will return 5 lines, where the difference between the lines is their "Metadata_Plate" value and their "Metadata_Treatment_Replicate" value.
How come there seems to exist replicates within the Efficientnet data when the data is aggregated based on wells? And if the Efficientnet values are aggregations themselves, then how does these tie into the CellProfiler data and its lone row?