cognoma / frontend

Frontend for Project Cognoma
http://cognoma.org/
Other
4 stars 22 forks source link

Jobs fail due to too few positive samples #102

Closed rdvelazquez closed 7 years ago

rdvelazquez commented 7 years ago

Can the frontend catch queries with too few positive samples and tell the user that the query can't be submitted because there aren't enough positive samples? @bdolly and @dcgoss: is this doable?

There are two motivating factors for this:

  1. Some queries fail because there are too few positive samples (samples with a mutation in at least one of the selected diseases). This is obviously the case if there are no mutated samples but this is also the case if there are just too few mutated samples. As an example, I tried a query with NF1 {entrez id 4763} and Glioblastoma {acronym GBM}, this query has 152 total samples and 9 mutated samples. I got the following error email:
    
    An error has occurred and your classifier could not be processed.
    Error: An error occurred while executing the following cell:
    ------------------
    y_pred_dict = {
    model: {
        'train': pipeline.decision_function(X_train),
        'test':  pipeline.decision_function(X_test)
    } for model, pipeline in cv_pipelines.items()
    }

def get_threshold_metrics(y_true, y_pred): roc_columns = ['fpr', 'tpr', 'threshold'] roc_items = zip(roc_columns, roc_curve(y_true, y_pred)) roc_df = pd.DataFrame.from_items(roc_items) auroc = roc_auc_score(y_true, y_pred) return {'auroc': auroc, 'roc_df': roc_df}

metrics_dict = { model: { 'train': get_threshold_metrics(y_train, y_pred_dict[model]['train']), 'test': get_threshold_metrics(y_test, y_pred_dict[model]['test']) } for model in y_pred_dict.keys() }

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case. Support is available at https://github.com/cognoma.


We could get this error message to go away for queries with at least 5 mutations by making a few edits to the notebook (such as changing the test train split from 10% to 20%) but there is also another reason why we want to limit queries with low positives (below).

2. The way the notebook is currently set up it will have trouble handling queries with a low number of positives. There are some things that can be done to improve this, such as using <StratifiedShuffleSplit> with a large number of splits as was done by @htcai in [#71](https://github.com/cognoma/machine-learning/pull/71), but I don't think we are planning on implementing that right now. So we would like to set a lower bound on the queries that cognoma will process. 

The numbers that were talked about at last night's meetup (8/15) for the lower bound of number of positives were somewhere between 20 and 50, and it seemed like we were leaning toward the higher end of this range. One of the downsides of setting the lower bound too high is it will limit the queries that users can request. of the ~20,000 genes there are ~8,000 that do not have more that 20 samples with mutations in the data set and there are ~16,000 genes that do not have more than 50 samples with mutations.
rdvelazquez commented 7 years ago

Closed by #104