Closed stephenshank closed 8 years ago
In my previous review #46 should be #47. Oops!
covariates.tsv
is < 3 MBs, so we should be able to track. If there is any strong preference for where it should reside, let me know, otherwise I will put it at the top level in the next PR.
Edit: On second thought, I will create a covariates directory with this notebook and put the tsv file there.
I would suggest perhaps making this notebook 1 in this repo and writing covariates.tsv
to a data
directory.
However, I'm starting to think that this work belongs in cancer-data
? @gwaygenomics and @stephenshank, what do you think about adding it the the notebook pipeline in cancer data?
Didn't know if it applies here but I often replace missing value with the predictions from other variables
Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?
@dhimmel I was actually thinking the same thing. I would be happy to move this over to cancer-data
as the latest notebook, with covariates.tsv
being placed in data
so that it is tracked. Your suggestion above reduced file size to ~ 1.2 MB. Does this sound okay?
Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?
@yl565, are there any sklearn models that handle missing values? What do you mean by "handle the NaNs at the classification stage"?
file size to ~ 1.2 MB. Does this sound okay?
stephenshank, sounds great.
I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked.
Okay let's move this to cancer-data
. You should open a new PR there that references this pull request.
@dhimmel sklearn.preprocessing.Imputer
handles missing values which can be included in the classification pipeline. I mean for people doing mutation classification, they may prefer to process the missing values themselves rather than working with already pre-processed data (at least I am...). There are different ways to deal with the missing data which could interact with classification algorithms differently.
A recommended way to deal with missing value could be set as default and it would be nice to have the option to get the raw, unpreprocessed data
There are different ways to deal with the missing data which could interact with classification algorithms differently.
Interesting... how about we perform imputation in cancer-data
resulting in two possible covariate datasets: covariates.tsv
and covariates-imputed.tsv
(which would use a single imputation method we decide on). Then machine learning users would be free to select either the imputed covariates or perform imputation themselves.
That would be nice!
Really nice analysis! I thought the age vs. mutation burden by tissue is very interesting... However, not sure it would be safe to impute age this way.
In general, my thoughts on imputation are as follows:
That being said, it could still be a interesting exercise and research question to try to impute all missing info.
All feedback is welcome. Also going to get some discussion going in #21.