broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Reprocess 10 plates because 36 features are missing #88

Open shntnu opened 2 years ago

shntnu commented 2 years ago

@EchteRobert reported this:

Q about the LINCS dataset: I have run into plates which contain slightly fewer measured features than the bulk of the plates, i.e., 1745 instead of 1781. Is this a known issue? All of the plates that have 1745 features use platemap “C-7161-01-LM6-001”. See the thread for missing features. Note that these numbers are after preprocessing so especially some Image features may not be included in these 1781/1745

``` {‘Cells_RadialDistribution_FracAtD_DNA_1of4’, ‘Cells_RadialDistribution_FracAtD_DNA_2of4’, ‘Cells_RadialDistribution_FracAtD_DNA_3of4’, ‘Cells_RadialDistribution_FracAtD_DNA_4of4’, ‘Cells_RadialDistribution_MeanFrac_DNA_1of4’, ‘Cells_RadialDistribution_MeanFrac_DNA_2of4’, ‘Cells_RadialDistribution_MeanFrac_DNA_3of4’, ‘Cells_RadialDistribution_MeanFrac_DNA_4of4’, ‘Cells_RadialDistribution_RadialCV_DNA_1of4’, ‘Cells_RadialDistribution_RadialCV_DNA_2of4’, ‘Cells_RadialDistribution_RadialCV_DNA_3of4’, ‘Cells_RadialDistribution_RadialCV_DNA_4of4’} {‘Cytoplasm_RadialDistribution_FracAtD_DNA_1of4’, ‘Cytoplasm_RadialDistribution_FracAtD_DNA_2of4’, ‘Cytoplasm_RadialDistribution_FracAtD_DNA_3of4’, ‘Cytoplasm_RadialDistribution_FracAtD_DNA_4of4’, ‘Cytoplasm_RadialDistribution_MeanFrac_DNA_1of4’, ‘Cytoplasm_RadialDistribution_MeanFrac_DNA_2of4’, ‘Cytoplasm_RadialDistribution_MeanFrac_DNA_3of4’, ‘Cytoplasm_RadialDistribution_MeanFrac_DNA_4of4’, ‘Cytoplasm_RadialDistribution_RadialCV_DNA_1of4’, ‘Cytoplasm_RadialDistribution_RadialCV_DNA_2of4’, ‘Cytoplasm_RadialDistribution_RadialCV_DNA_3of4’, ‘Cytoplasm_RadialDistribution_RadialCV_DNA_4of4’} {‘Nuclei_RadialDistribution_FracAtD_DNA_1of4’, ‘Nuclei_RadialDistribution_FracAtD_DNA_2of4’, ‘Nuclei_RadialDistribution_FracAtD_DNA_3of4’, ‘Nuclei_RadialDistribution_FracAtD_DNA_4of4’, ‘Nuclei_RadialDistribution_MeanFrac_DNA_1of4’, ‘Nuclei_RadialDistribution_MeanFrac_DNA_2of4’, ‘Nuclei_RadialDistribution_MeanFrac_DNA_3of4’, ‘Nuclei_RadialDistribution_MeanFrac_DNA_4of4’, ‘Nuclei_RadialDistribution_RadialCV_DNA_1of4’, ‘Nuclei_RadialDistribution_RadialCV_DNA_2of4’, ‘Nuclei_RadialDistribution_RadialCV_DNA_3of4’, ‘Nuclei_RadialDistribution_RadialCV_DNA_4of4’} {‘Image_ExecutionTime_03MeasureImageQuality’, ‘Image_ExecutionTime_04MeasureImageQuality’, ‘Image_ExecutionTime_11MeasureObjectIntensityDistribution’, ‘Image_ExecutionTime_12MeasureObjectIntensity’, ‘Image_ExecutionTime_15MeasureObjectNeighbors’, ‘Image_ImageQuality_Correlation_IllumAGP_10’, ‘Image_ImageQuality_Correlation_IllumAGP_20’, ‘Image_ImageQuality_Correlation_IllumAGP_5’, ‘Image_ImageQuality_Correlation_IllumAGP_50’, ‘Image_ImageQuality_Correlation_IllumDNA_10’, ‘Image_ImageQuality_Correlation_IllumDNA_20’, ‘Image_ImageQuality_Correlation_IllumDNA_5’, ‘Image_ImageQuality_Correlation_IllumDNA_50’, ‘Image_ImageQuality_Correlation_IllumER_10’, ‘Image_ImageQuality_Correlation_IllumER_20’, ‘Image_ImageQuality_Correlation_IllumER_5’, ‘Image_ImageQuality_Correlation_IllumER_50’, ‘Image_ImageQuality_Correlation_IllumMito_10’, ‘Image_ImageQuality_Correlation_IllumMito_20’, ‘Image_ImageQuality_Correlation_IllumMito_5’, ‘Image_ImageQuality_Correlation_IllumMito_50’, ‘Image_ImageQuality_Correlation_IllumRNA_10’, ‘Image_ImageQuality_Correlation_IllumRNA_20’, ‘Image_ImageQuality_Correlation_IllumRNA_5’, ‘Image_ImageQuality_Correlation_IllumRNA_50’, ‘Image_ImageQuality_Correlation_OrigAGP_10’, ‘Image_ImageQuality_Correlation_OrigAGP_20’, ‘Image_ImageQuality_Correlation_OrigAGP_5’, ‘Image_ImageQuality_Correlation_OrigAGP_50’, ‘Image_ImageQuality_Correlation_OrigDNA_10’, ‘Image_ImageQuality_Correlation_OrigDNA_20’, ‘Image_ImageQuality_Correlation_OrigDNA_5’, ‘Image_ImageQuality_Correlation_OrigDNA_50’, ‘Image_ImageQuality_Correlation_OrigER_10’, ‘Image_ImageQuality_Correlation_OrigER_20’, ‘Image_ImageQuality_Correlation_OrigER_5’, ‘Image_ImageQuality_Correlation_OrigER_50’, ‘Image_ImageQuality_Correlation_OrigMito_10’, ‘Image_ImageQuality_Correlation_OrigMito_20’, ‘Image_ImageQuality_Correlation_OrigMito_5’, ‘Image_ImageQuality_Correlation_OrigMito_50’, ‘Image_ImageQuality_Correlation_OrigRNA_10’, ‘Image_ImageQuality_Correlation_OrigRNA_20’, ‘Image_ImageQuality_Correlation_OrigRNA_5’, ‘Image_ImageQuality_Correlation_OrigRNA_50’, ‘Image_ImageQuality_FocusScore_IllumAGP’, ‘Image_ImageQuality_FocusScore_IllumDNA’, ‘Image_ImageQuality_FocusScore_IllumER’, ‘Image_ImageQuality_FocusScore_IllumMito’, ‘Image_ImageQuality_FocusScore_IllumRNA’, ‘Image_ImageQuality_FocusScore_OrigAGP’, ‘Image_ImageQuality_FocusScore_OrigDNA’, ‘Image_ImageQuality_FocusScore_OrigER’, ‘Image_ImageQuality_FocusScore_OrigMito’, ‘Image_ImageQuality_FocusScore_OrigRNA’, ‘Image_ImageQuality_LocalFocusScore_IllumAGP_10’, ‘Image_ImageQuality_LocalFocusScore_IllumAGP_20’, ‘Image_ImageQuality_LocalFocusScore_IllumAGP_5’, ‘Image_ImageQuality_LocalFocusScore_IllumAGP_50’, ‘Image_ImageQuality_LocalFocusScore_IllumDNA_10’, ‘Image_ImageQuality_LocalFocusScore_IllumDNA_20’, ‘Image_ImageQuality_LocalFocusScore_IllumDNA_5’, ‘Image_ImageQuality_LocalFocusScore_IllumDNA_50’, ‘Image_ImageQuality_LocalFocusScore_IllumER_10’, ‘Image_ImageQuality_LocalFocusScore_IllumER_20’, ‘Image_ImageQuality_LocalFocusScore_IllumER_5’, ‘Image_ImageQuality_LocalFocusScore_IllumER_50’, ‘Image_ImageQuality_LocalFocusScore_IllumMito_10’, ‘Image_ImageQuality_LocalFocusScore_IllumMito_20’, ‘Image_ImageQuality_LocalFocusScore_IllumMito_5’, ‘Image_ImageQuality_LocalFocusScore_IllumMito_50’, ‘Image_ImageQuality_LocalFocusScore_IllumRNA_10’, ‘Image_ImageQuality_LocalFocusScore_IllumRNA_20’, ‘Image_ImageQuality_LocalFocusScore_IllumRNA_5’, ‘Image_ImageQuality_LocalFocusScore_IllumRNA_50’, ‘Image_ImageQuality_LocalFocusScore_OrigAGP_10’, ‘Image_ImageQuality_LocalFocusScore_OrigAGP_20’, ‘Image_ImageQuality_LocalFocusScore_OrigAGP_5’, ‘Image_ImageQuality_LocalFocusScore_OrigAGP_50’, ‘Image_ImageQuality_LocalFocusScore_OrigDNA_10’, ‘Image_ImageQuality_LocalFocusScore_OrigDNA_20’, ‘Image_ImageQuality_LocalFocusScore_OrigDNA_5’, ‘Image_ImageQuality_LocalFocusScore_OrigDNA_50’, ‘Image_ImageQuality_LocalFocusScore_OrigER_10’, ‘Image_ImageQuality_LocalFocusScore_OrigER_20’, ‘Image_ImageQuality_LocalFocusScore_OrigER_5’, ‘Image_ImageQuality_LocalFocusScore_OrigER_50’, ‘Image_ImageQuality_LocalFocusScore_OrigMito_10’, ‘Image_ImageQuality_LocalFocusScore_OrigMito_20’, ‘Image_ImageQuality_LocalFocusScore_OrigMito_5’, ‘Image_ImageQuality_LocalFocusScore_OrigMito_50’, ‘Image_ImageQuality_LocalFocusScore_OrigRNA_10’, ‘Image_ImageQuality_LocalFocusScore_OrigRNA_20’, ‘Image_ImageQuality_LocalFocusScore_OrigRNA_5’, ‘Image_ImageQuality_LocalFocusScore_OrigRNA_50’, ‘Image_ImageQuality_MADIntensity_IllumAGP’, ‘Image_ImageQuality_MADIntensity_IllumDNA’, ‘Image_ImageQuality_MADIntensity_IllumER’, ‘Image_ImageQuality_MADIntensity_IllumMito’, ‘Image_ImageQuality_MADIntensity_IllumRNA’, ‘Image_ImageQuality_MADIntensity_OrigAGP’, ‘Image_ImageQuality_MADIntensity_OrigDNA’, ‘Image_ImageQuality_MADIntensity_OrigER’, ‘Image_ImageQuality_MADIntensity_OrigMito’, ‘Image_ImageQuality_MADIntensity_OrigRNA’, ‘Image_ImageQuality_MaxIntensity_IllumAGP’, ‘Image_ImageQuality_MaxIntensity_IllumDNA’, ‘Image_ImageQuality_MaxIntensity_IllumER’, ‘Image_ImageQuality_MaxIntensity_IllumMito’, ‘Image_ImageQuality_MaxIntensity_IllumRNA’, ‘Image_ImageQuality_MaxIntensity_OrigAGP’, ‘Image_ImageQuality_MaxIntensity_OrigDNA’, ‘Image_ImageQuality_MaxIntensity_OrigER’, ‘Image_ImageQuality_MaxIntensity_OrigMito’, ‘Image_ImageQuality_MaxIntensity_OrigRNA’, ‘Image_ImageQuality_MeanIntensity_IllumAGP’, ‘Image_ImageQuality_MeanIntensity_IllumDNA’, ‘Image_ImageQuality_MeanIntensity_IllumER’, ‘Image_ImageQuality_MeanIntensity_IllumMito’, ‘Image_ImageQuality_MeanIntensity_IllumRNA’, ‘Image_ImageQuality_MeanIntensity_OrigAGP’, ‘Image_ImageQuality_MeanIntensity_OrigDNA’, ‘Image_ImageQuality_MeanIntensity_OrigER’, ‘Image_ImageQuality_MeanIntensity_OrigMito’, ‘Image_ImageQuality_MeanIntensity_OrigRNA’, ‘Image_ImageQuality_MedianIntensity_IllumAGP’, ‘Image_ImageQuality_MedianIntensity_IllumDNA’, ‘Image_ImageQuality_MedianIntensity_IllumER’, ‘Image_ImageQuality_MedianIntensity_IllumMito’, ‘Image_ImageQuality_MedianIntensity_IllumRNA’, ‘Image_ImageQuality_MedianIntensity_OrigAGP’, ‘Image_ImageQuality_MedianIntensity_OrigDNA’, ‘Image_ImageQuality_MedianIntensity_OrigER’, ‘Image_ImageQuality_MedianIntensity_OrigMito’, ‘Image_ImageQuality_MedianIntensity_OrigRNA’, ‘Image_ImageQuality_MinIntensity_IllumAGP’, ‘Image_ImageQuality_MinIntensity_IllumDNA’, ‘Image_ImageQuality_MinIntensity_IllumER’, ‘Image_ImageQuality_MinIntensity_IllumMito’, ‘Image_ImageQuality_MinIntensity_IllumRNA’, ‘Image_ImageQuality_MinIntensity_OrigAGP’, ‘Image_ImageQuality_MinIntensity_OrigDNA’, ‘Image_ImageQuality_MinIntensity_OrigER’, ‘Image_ImageQuality_MinIntensity_OrigMito’, ‘Image_ImageQuality_MinIntensity_OrigRNA’, ‘Image_ImageQuality_PercentMaximal_IllumAGP’, ‘Image_ImageQuality_PercentMaximal_IllumDNA’, ‘Image_ImageQuality_PercentMaximal_IllumER’, ‘Image_ImageQuality_PercentMaximal_IllumMito’, ‘Image_ImageQuality_PercentMaximal_IllumRNA’, ‘Image_ImageQuality_PercentMaximal_OrigAGP’, ‘Image_ImageQuality_PercentMaximal_OrigDNA’, ‘Image_ImageQuality_PercentMaximal_OrigER’, ‘Image_ImageQuality_PercentMaximal_OrigMito’, ‘Image_ImageQuality_PercentMaximal_OrigRNA’, ‘Image_ImageQuality_PercentMinimal_IllumAGP’, ‘Image_ImageQuality_PercentMinimal_IllumDNA’, ‘Image_ImageQuality_PercentMinimal_IllumER’, ‘Image_ImageQuality_PercentMinimal_IllumMito’, ‘Image_ImageQuality_PercentMinimal_IllumRNA’, ‘Image_ImageQuality_PercentMinimal_OrigAGP’, ‘Image_ImageQuality_PercentMinimal_OrigDNA’, ‘Image_ImageQuality_PercentMinimal_OrigER’, ‘Image_ImageQuality_PercentMinimal_OrigMito’, ‘Image_ImageQuality_PercentMinimal_OrigRNA’, ‘Image_ImageQuality_PowerLogLogSlope_IllumAGP’, ‘Image_ImageQuality_PowerLogLogSlope_IllumDNA’, ‘Image_ImageQuality_PowerLogLogSlope_IllumER’, ‘Image_ImageQuality_PowerLogLogSlope_IllumMito’, ‘Image_ImageQuality_PowerLogLogSlope_IllumRNA’, ‘Image_ImageQuality_PowerLogLogSlope_OrigAGP’, ‘Image_ImageQuality_PowerLogLogSlope_OrigDNA’, ‘Image_ImageQuality_PowerLogLogSlope_OrigER’, ‘Image_ImageQuality_PowerLogLogSlope_OrigMito’, ‘Image_ImageQuality_PowerLogLogSlope_OrigRNA’, ‘Image_ImageQuality_Scaling_IllumAGP’, ‘Image_ImageQuality_Scaling_IllumDNA’, ‘Image_ImageQuality_Scaling_IllumER’, ‘Image_ImageQuality_Scaling_IllumMito’, ‘Image_ImageQuality_Scaling_IllumRNA’, ‘Image_ImageQuality_Scaling_OrigAGP’, ‘Image_ImageQuality_Scaling_OrigDNA’, ‘Image_ImageQuality_Scaling_OrigER’, ‘Image_ImageQuality_Scaling_OrigMito’, ‘Image_ImageQuality_Scaling_OrigRNA’, ‘Image_ImageQuality_StdIntensity_IllumAGP’, ‘Image_ImageQuality_StdIntensity_IllumDNA’, ‘Image_ImageQuality_StdIntensity_IllumER’, ‘Image_ImageQuality_StdIntensity_IllumMito’, ‘Image_ImageQuality_StdIntensity_IllumRNA’, ‘Image_ImageQuality_StdIntensity_OrigAGP’, ‘Image_ImageQuality_StdIntensity_OrigDNA’, ‘Image_ImageQuality_StdIntensity_OrigER’, ‘Image_ImageQuality_StdIntensity_OrigMito’, ‘Image_ImageQuality_StdIntensity_OrigRNA’, ‘Image_ImageQuality_ThresholdOtsu_OrigDNA_2W’, ‘Image_ImageQuality_ThresholdOtsu_OrigRNA_3FW’, ‘Image_ImageQuality_TotalArea_IllumAGP’, ‘Image_ImageQuality_TotalArea_IllumDNA’, ‘Image_ImageQuality_TotalArea_IllumER’, ‘Image_ImageQuality_TotalArea_IllumMito’, ‘Image_ImageQuality_TotalArea_IllumRNA’, ‘Image_ImageQuality_TotalArea_OrigAGP’, ‘Image_ImageQuality_TotalArea_OrigDNA’, ‘Image_ImageQuality_TotalArea_OrigER’, ‘Image_ImageQuality_TotalArea_OrigMito’, ‘Image_ImageQuality_TotalArea_OrigRNA’, ‘Image_ImageQuality_TotalIntensity_IllumAGP’, ‘Image_ImageQuality_TotalIntensity_IllumDNA’, ‘Image_ImageQuality_TotalIntensity_IllumER’, ‘Image_ImageQuality_TotalIntensity_IllumMito’, ‘Image_ImageQuality_TotalIntensity_IllumRNA’, ‘Image_ImageQuality_TotalIntensity_OrigAGP’, ‘Image_ImageQuality_TotalIntensity_OrigDNA’, ‘Image_ImageQuality_TotalIntensity_OrigER’, ‘Image_ImageQuality_TotalIntensity_OrigMito’, ‘Image_ImageQuality_TotalIntensity_OrigRNA’, ‘Image_ModuleError_03MeasureImageQuality’, ‘Image_ModuleError_04MeasureImageQuality’, ‘Image_ModuleError_11MeasureObjectIntensityDistribution’, ‘Image_ModuleError_12MeasureObjectIntensity’, ‘Image_ModuleError_15MeasureObjectNeighbors’,} ```

@bethac07 said:

I don't think it's a known issue but based on the error messages I see how it happened. Was likely a pilot batch and/or a batch that was rerun later would be my guess If you search Slack for the barcode names there might be a message about them (not certain but non zero)

shntnu commented 2 years ago

We are having this discussion in Slack https://broadinstitute.slack.com/archives/C3QFQ3WQM/p1663329404078119?thread_ts=1663317811.106069&cid=C3QFQ3WQM

@EchteRobert will report back here with our conclusion here, once we are set

EchteRobert commented 2 years ago

You can find the main conclusion of this issue below.

Beth (she/her)

I don't think it's a known issue but based on the error messages I see how it happened. Was likely a pilot batch and/or a batch that was rerun later would be my guess, and at the time the 1) MeasureImageQuality modules were turned off (but not deleted) and 2) a couple of the measurement modules were rearranged) and 3) DNA was un-checked in the MeasureObjectIntensityDistribution module

Supporting discussion (for the full discussion see the Slack thread)

shantanu

The difference is 1781-1745=36 features, all of which are RadialDistribution features in the DNA channel across all the three compartments. I assume this is in the SQLite files, and not in the aggregated CSV files? Can you paste the list of plates where these features are missing?

Robert

Yes, this is in the SQLite files. I have not checked the aggregated CSV files. They are missing in these plates: SQ00015116, SQ00015117, SQ00015118, SQ00015119 (and I expect also SQ00015125 due to it having the same platemap but I haven’t downloaded this one yet)

Beth

https://broadinstitute.slack.com/archives/C3QFDHXC4/p1466877949000013 Looks like those were part of the very first batch of plates So my guess is that the pipeline got sightly changed after

shantanu

I confirm that the aggregated files have the same issue

x <- read_csv("https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/SQ00015116/SQ00015116.csv", n_max = 2)
Rows: 2 Columns: 1749                                                                                                                                                                             
y <- read_csv("https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/SQ00015100/SQ00015100.csv", n_max = 2)
Rows: 2 Columns: 1785                                                                                                                                                                             
> 1747-1785
[1] -36

Beth

Yeah, the modules in question were likely not run, which is why they don't have module execution times or error message reports in the image features

shantanu

Ah that's right For the record, these are the modules: {'Image_ExecutionTime_03MeasureImageQuality', 'Image_ExecutionTime_04MeasureImageQuality', 'Image_ExecutionTime_11MeasureObjectIntensityDistribution', 'Image_ExecutionTime_12MeasureObjectIntensity', 'Image_ExecutionTime_15MeasureObjectNeighbors',

shantanu

But out of the last 3, I looks like only Image_ExecutionTime_11MeasureObjectIntensityDistributionwas the one that wasn't run, right Beth (she/her)? Because all are RadialDistribution features in the DNA channel across all the three compartments, which is what we measure in MeasureObjectIntensityDistribution (edited)

shantanu

But there are no missing features wrt MeasureObjectIntensity and MeasureObjectNeighbors

Beth (she/her)

Right, but if the order if those was changed, you wouldn't have an exact match column (edited) But you might have a similarly named one with just a slightly different number Number=position in the pipeline Probably intensity distribution WAS run, just with DNA not checked My guess Robert is that the other plates mentioned in the thing I linked will have the same issue

shantanu

aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/%7Ctr -s " "|cut -c6-15|sort > /tmp/platelist
parallel -a /tmp/platelist "echo -n {1}; aws s3 cp s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/{1}/{1}.csv -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n > /tmp/counts
cat /t
mp/counts |cut -d"," -f1|uniq -c
  10 1749
 126 1785
cat /tmp/counts |grep 1749
1749,SQ00015116
1749,SQ00015117
1749,SQ00015118
1749,SQ00015119
1749,SQ00015120
1749,SQ00015121
1749,SQ00015122
1749,SQ00015123
1749,SQ00015125
1749,SQ00015126
shntnu commented 2 years ago

In the future, we might want to reprocess these 10 plates and include the missing features

bethac07 commented 2 years ago

This should make it possible if we need to - https://hub.docker.com/layers/cellprofiler/cellprofiler/2.3.1/images/sha256-d790b21623654e351390e283e3243860a7595120f7ba1e5f1df36b1277ea0cf1?context=explore

gwaybio commented 2 years ago

Pointing here https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451 as these plates seemed to have posed difficulties in the past.