broadinstitute / profiling-resistance-mechanisms

Predicting pharmacodynamic responses to cancer drugs using cell morphology
BSD 3-Clause "New" or "Revised" License
6 stars 5 forks source link

What is the best way to select Profiling Variables? #9

Closed gwaybio closed 4 years ago

gwaybio commented 5 years ago

In working through #1 and #8 I thought about how I would combine data from batches together. My two options seemed to be:

  1. Take feature union
  2. Take feature intersection

The problem with taking the feature intersection is that fewer features are selected. The problem with taking feature union is that some variables may not be "good" features.

What is the method for cell painting variable selection? If the features are removed b/c of lack of consistency across replicates, then they should not be included in the feature union. However, if the features are removed because they are deemed redundant, then they should be included in the feature union.

@shntnu is there any way to track the decisions for feature selection?

gwaybio commented 5 years ago

I have realized that a lot of this information can be found in the profiling handbook.

shntnu commented 5 years ago

Ine

I have realized that a lot of this information can be found in the profiling handbook.

Yep, albeit terse :)

To actually apply variable selection, we compute the intersection of all these variable lists

shntnu commented 5 years ago

The problem with taking the feature intersection is that fewer features are selected. The problem with taking feature union is that some variables may not be "good" features.

Our current practice is to do one of these, both have problems: (1) Use a "standard" variable list that was generated on a much larger dataset, e.g. the drug repurposing set (unpublished), or the CDRP dataset (selected features are not publicly available, but can easily do so)

(2) Redo variable selection when combining batches

For now, I would recommend doing (2) until we figure out a better way to do this. I would not recommend union or intersection because it is probably worse that doing (2).

gwaybio commented 5 years ago

(2) Redo variable selection when combining batches

For now, I would recommend doing (2) until we figure out a better way to do this. I would not recommend union or intersection because it is probably worse that doing (2).

Gotcha! So this step will be run when each batch of new data comes in?

shntnu commented 5 years ago

Gotcha! So this step will be run when each batch of new data comes in?

Yep