broadinstitute / neural-profiling

1 stars 2 forks source link

Difference in the data between the CellProfiler data and the Efficientnet data #9

Open MattiasSehlstedt opened 3 years ago

MattiasSehlstedt commented 3 years ago

What is the reason for there existing several replications across several different "Metadata_Plate" in the efficientnet data, while this doesn't exist within the CellProfiler data?

If one loads each of the two datasets, and runs the query display(df[df.Metadata_broad_sample == 'BRD-K05804044-001-06-0'])) then the CellProfiler data will return 6 lines, where the difference between them is the dose concentration. image

The Efficientnet data will return 5 lines, where the difference between the lines is their "Metadata_Plate" value and their "Metadata_Treatment_Replicate" value.

image

How come there seems to exist replicates within the Efficientnet data when the data is aggregated based on wells? And if the Efficientnet values are aggregations themselves, then how does these tie into the CellProfiler data and its lone row?

MattiasSehlstedt commented 3 years ago

The datasets are https://github.com/broadinstitute/neural-profiling/blob/main/pre-trained/efficient_net/aggregated/aggregated_efficientnet_median.csv and https://github.com/broadinstitute/lincs-cell-painting/blob/master/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz_feature_select_dmso.csv.gz

michaelbornholdt commented 3 years ago

Yea this is confusing because those are two different stages of data. Level 5 above and level 3 below.

See: https://github.com/broadinstitute/lincs-cell-painting/tree/master/profiles https://github.com/broadinstitute/neural-profiling/wiki/01_Baseline

Top: Technical replicas are already aggregated so the 6 different dosages are visible.

Bottom: Technical replicas are visible here. The other doses have been deleted. See my subselection notebook

michaelbornholdt commented 3 years ago

@MattiasSehlstedt Hope that makes it clear.

Also, if you use @ symbols, I will respond faster next time :)

MattiasSehlstedt commented 3 years ago

@michaelbornholdt So I guess that would mean that I would either have to modz your efficientnet data or work with https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/2016_04_01_a549_48hr_batch1.dvc if I want a one-to-one row relation between some CellProfiler data and your Efficientnet data?

michaelbornholdt commented 3 years ago

Yes correct! it depends on what kind of analysis you wanna do. If you are running somth like Enrichment which compare compounds (level 5) then just aggregate the Efficientnet profiles.

Also. You should be using the Spherized CP data instead of the non spherized.

michaelbornholdt commented 3 years ago

https://github.com/broadinstitute/lincs-cell-painting/tree/master/spherized_profiles/consensus