cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

What covariates should we include as features? #21

Open dhimmel opened 8 years ago

dhimmel commented 8 years ago

In addition to gene expression, we probably should include other information on samples. This discussion will focus on identifying potential covariates and evaluating whether they make sense to include in models. If we don't include the right covariates, confounding is likely to be an issue.

See #8 as a potential example of confounding that may be addressable by adding a mutation load feature.

dhimmel commented 8 years ago

@stephenshank began work at the last meetup on creating a covariates.tsv from samples.tsv. @stephenshank any updates?

stephenshank commented 8 years ago

My first attempt at this can be found at #46, where I simply try to process the samples data to begin using these as features. Immediate issues are...

  1. Are we comfortable with the way categorical NaN's are handled?
  2. How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.
  3. Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

As a longer term issue, I was eager to do some actual machine learning, but my naive attempt at using the existing classifier failed. So it would be great to get some discussion going, regarding what tweaks we expect to enhance performance. Now reading through more issues, I realize I should've tried this on some of the Hippo pathway genes. But I am worried that munging the features together as I have is not an effective approach.

Of course I would also welcome discussion on anything I may have missed... @dhimmel, @gwaygenomics any thoughts?

dhimmel commented 7 years ago

Are we comfortable with the way categorical NaN's are handled?

As per my review comment on #46: yes, although imputation is definitely an option here.

How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.

I know of 3 options: impute, filter observations, or remove variable. Since we don't want to start hemorrhaging samples, I don't think we should filter many observations. So maybe we can assess imputation/removal on a per-variable basis. i.e. will it impute, if so impute and keep, else remove.

Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

Personally, I make all_lowercase_underscore_separated variable names. However, I do see a benefit in not messing with Xena names unless we store a reversible mapping. In other words, if using foreign data, it's sometimes better to use dirty column names than break interoperability. However, lot's of these variables have already been changed or recoded in cancer-data, so interoperability is less of a worry for those variables.

stephenshank commented 7 years ago

Personally, I make all_lowercase_underscore_separated variable names

I'm tempted to do this, just because I consider this part of clean data for developers. It can get frustrating trying to autocomplete variables that you know are there, only to remember that THOSE variables are upper case.

I hope to have some progress on imputing the numeric variables for the next PR. The only strategies I know for this are either 1) filling in the most common values among similar cases, or 2) exploring correlations. For instance, we should be able to impute the missing age_diagnosed with the correlation that you found with the number of mutations, perhaps through a linear regression. Any other strategies are welcome.

stephenshank commented 7 years ago

@dhimmel I made the proposed changes and started to do some exploratory visualization for imputation, which I've pushed for review. Also note that the notebook and the script are in two separate commits, since I forgot to convert before I pushed :grimacing:. Please let me know if there are any more revisions, I am happy to continue working on this.

Also, any suggestions for how to carry out the imputation are welcome... I've made some of my own in the notebook.

dhimmel commented 7 years ago

@stephenshank, let's keep discussion related to PR #46 on the actual pull request. I didn't see your latest two comments until after my most recent review ):

Let's start a new issue for covariate imputation and deal with it in a future pull request.