broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Dropping outlier features #65

Closed shntnu closed 3 years ago

shntnu commented 3 years ago

MB said:

I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.

image

This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142

  1. Add drop_outliers to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py
  2. Reprocess
gwaybio commented 3 years ago

Nice - i don't think it's worth doing before the first data freeze (see #63)

But it is definitely worth noting which features this impacts - @michaelbornholdt do you have this info? Are they only the three features?

I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63

michaelbornholdt commented 3 years ago

@gwaygenomics Here are the features that have higher values than 200: image

So just to be sure, I will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct? @shntnu

shntnu commented 3 years ago

will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct?

yes

shntnu commented 3 years ago

i don't think it's worth doing before the first data freeze

yes

gwaybio commented 3 years ago

@shntnu - it turns out that I can very easily update #63 consensus and spherized profiles to add drop_outliers without having to rerun everything.

@michaelbornholdt do you recommend using 200 as a cutoff? I use 60 currently in spherized profiles, but I'd be happy to update to 200 if you have any data-driven rationale

michaelbornholdt commented 3 years ago

I can try several dropouts and look at the precision recall or do you guys have a better idea of deciding the threshold?

gwaybio commented 3 years ago

that sounds good to me. What specifically will you try? Altering outlier_num in (np.abs(df[feature]) > outlier_num) ?

michaelbornholdt commented 3 years ago

The following is the precision at k = 5 for different threshold values: It looks like anything from 100-500 is a sensible value to use

threshold 60.000000 precision 0.776667 threshold 100.000000 precision 0.786667 threshold 200.00 precision 0.78 threshold 500.000000 precision 0.783333 threshold 1000.000000 precision 0.783333 threshold 10000.000000 precision 0.733333

michaelbornholdt commented 3 years ago

I haven't worked with the outlier functionality. Will need to get my head around that part of the pipeline first. I just wrote my own function to drop the columns with the high values

gwaybio commented 3 years ago

awesome, thanks Michael!

gwaybio commented 3 years ago

The pycytominer drop outlier strategy is simple:

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/cyto_utils/features.py#L141-L143

based on your code screenshot in https://github.com/broadinstitute/lincs-cell-painting/issues/65#issuecomment-823613107 i think you're doing something very similar, if not exactly the same

michaelbornholdt commented 3 years ago

Can you update the files in the consensus then so that people don't run into the same problems?

gwaybio commented 3 years ago

yep, that is the plan in #63 I'll use 100 for the threshold

gwaybio commented 3 years ago

alright, I tried 100 (and then bumped it up to 200). I remember now why I didn't originally do this!

Setting the threshold to 200 keeps only 15 features in one of the normalization schemes 😬

How about we use your approach instead (somehow it must be different). Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows. I can easily remove them this way.

michaelbornholdt commented 3 years ago

This is what I am using.

For threshhold of 100, this drops 32 features. You want me to send those to you then?

def drop_bad_feats(df_old, features_old, threshold):
    drop_features = []
    for feat in features_old:
        if (np.abs(df_old[feat]) > threshold).any():
            drop_features.append(feat)
    df_out = df.drop(drop_features,  axis = "columns")
    print("dropped {} features".format(len(drop_features)))
    return df_out
gwaybio commented 3 years ago

yes, that would be great.

Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows.

There is a pycytominer function to drop custom columns - i'll just need to be careful with documentation.

michaelbornholdt commented 3 years ago

listfile.txt

Voila

gwaybio commented 3 years ago

Addressed in #63 - thanks everyone!