Closed shntnu closed 3 years ago
Nice - i don't think it's worth doing before the first data freeze (see #63)
But it is definitely worth noting which features this impacts - @michaelbornholdt do you have this info? Are they only the three features?
I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63
@gwaygenomics Here are the features that have higher values than 200:
So just to be sure, I will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct? @shntnu
will not do anything to the pipeline but just locally delete these features so I can carry on with my analysis. Correct?
yes
i don't think it's worth doing before the first data freeze
yes
@shntnu - it turns out that I can very easily update #63 consensus and spherized profiles to add drop_outliers
without having to rerun everything.
@michaelbornholdt do you recommend using 200 as a cutoff? I use 60 currently in spherized profiles, but I'd be happy to update to 200 if you have any data-driven rationale
I can try several dropouts and look at the precision recall or do you guys have a better idea of deciding the threshold?
that sounds good to me. What specifically will you try? Altering outlier_num
in (np.abs(df[feature]) > outlier_num)
?
The following is the precision at k = 5 for different threshold values: It looks like anything from 100-500 is a sensible value to use
threshold 60.000000 precision 0.776667 threshold 100.000000 precision 0.786667 threshold 200.00 precision 0.78 threshold 500.000000 precision 0.783333 threshold 1000.000000 precision 0.783333 threshold 10000.000000 precision 0.733333
I haven't worked with the outlier functionality. Will need to get my head around that part of the pipeline first. I just wrote my own function to drop the columns with the high values
awesome, thanks Michael!
The pycytominer drop outlier strategy is simple:
based on your code screenshot in https://github.com/broadinstitute/lincs-cell-painting/issues/65#issuecomment-823613107 i think you're doing something very similar, if not exactly the same
Can you update the files in the consensus then so that people don't run into the same problems?
yep, that is the plan in #63 I'll use 100 for the threshold
alright, I tried 100 (and then bumped it up to 200). I remember now why I didn't originally do this!
Setting the threshold to 200 keeps only 15 features in one of the normalization schemes 😬
How about we use your approach instead (somehow it must be different). Can you create a .txt file with a column header: outlier_features
and each of those features in that screenshot as independent rows. I can easily remove them this way.
This is what I am using.
For threshhold of 100, this drops 32 features. You want me to send those to you then?
def drop_bad_feats(df_old, features_old, threshold):
drop_features = []
for feat in features_old:
if (np.abs(df_old[feat]) > threshold).any():
drop_features.append(feat)
df_out = df.drop(drop_features, axis = "columns")
print("dropped {} features".format(len(drop_features)))
return df_out
yes, that would be great.
Can you create a .txt file with a column header: outlier_features and each of those features in that screenshot as independent rows.
There is a pycytominer function to drop custom columns - i'll just need to be careful with documentation.
Voila
Addressed in #63 - thanks everyone!
MB said:
This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.
https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142
drop_outliers
to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py