Defining features and labels

brankaj commented 8 years ago

This issue is a follow-up of the results obtained for different genes #52 . It is still not clear why few oncogenes produced such bad results. Before analyzing genes themselves, I got puzzled by one thing in the code.

If we want to run the classifier for a different gene, the only part that is currently changed is y, i.e., vector of labels y=Y[GENE]. Matrix X, which contains our feature values, remains the same. This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'. Even though each iteration corresponds to a different gene, classifier sees it as another combination of '0' and '1' for which model has to be built.

If the matrix X is static, i.e., its values are completely reliable, I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.

dhimmel commented 8 years ago

It is still not clear why few oncogenes produced such bad results.

I'm happy to see mediocre results for modeling some mutations. With gene expression, universally positive results are usually a good indication that you're overlooking something. @gwaygenomics or @cgreene would know better, but here's why it may be totally acceptable that a mutation doesn't have an expression signature:

the gene doesn't do much so whether it's mutated or not doesn't affect cellular function.
the mutations are mostly passenger mutations rather than driver mutations. Basically, they're along for the ride, but aren't in the driver's seat.
our mutation measure isn't fine grained enough to be biologically meaningful. See https://github.com/cognoma/cancer-data/issues/15

I think @cgreene suspects most mutations will be difficult to classify. The ones that classify well are truly special and may point to eventual therapeutic targets.

X, which contains our feature values, remains the same

Yes, good observation. In practice X could be subsetted if a user selects only samples with a certain cancer. But if you're using all samples, X will be the same for every model.

This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'.

This is expected and makes sense to me. When you change your mutation, you are asking a different question that requires a different model.

I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.

Regarding the reliability of Y, I sort of feel like this is an upstream issue. We take what we get. However, the raw Xena mutation data does contain some sequencing replicates, which could allow you to estimate the sequencing fidelity. My impression is that the mutation calling is decent, right @gwaygenomics?

brankaj commented 8 years ago

Thank you for your comments. I had misconceptions regarding the use of matrix X. Regarding the sequencing replicates, if I understood well, this would correspond to the multi-label thing that I mentioned earlier. Unfortunately, I am not aware of many papers that deal with this issue. The common approach is the majority voting technique.

dhimmel commented 8 years ago

According to sklearn's docs:

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

I'm not sure how multilabel classification would fit in with sequencing replicates. I was thinking that the replicates would be most useful as a way of examining the reliability of the sequencing. Do two independent sequencing runs of the same sample yield the same mutations?

To keep things simple, we probably don't want to venture into multilabel classification, ... but we could fit a model where each mutation was a separate "label". Maybe there would be certain benefits to fitting all models together, but I'm not sure.

cgreene commented 8 years ago

Agree as a future interest. Transfer learning approaches should be very well suited here (and transferability is also interesting scientifically). Not sure how much we want to dig in at this time.

gwaybio commented 8 years ago

My impression is that the mutation calling is decent, right @gwaygenomics?

They are state of the art exome mutation calls. Right now, 6 mutations caller algorithms are applied to the data and variants are removed if they are only called once.

cognoma / machine-learning

Defining features and labels #59