Machine Learning Punch List for Launch

rdvelazquez commented 7 years ago

Here's the general punch list that we discussed at tonight's meetup for getting the machine learning part of cognoma launch ready.

[x] Update Notebook 2 to be compatible with ml-workers and PR ml-workers so the two notebooks are the same (https://github.com/cognoma/machine-learning/pull/111 by @patrick-miller)
[x] Change plotting in notebooks from vega to plotnine (https://github.com/cognoma/machine-learning/pull/112 by @patrick-miller)
[x] Look into the memory usage issues (@wisygig)
[x] Look into the number of pca components to use (#113 and #114 by @rdvelazquez)
[x] Optimize between memory usage, pca components (and the regulirization_alpha_list), and how large of an AWS instance we use (@dhimmel)

To be completed at a later date: Templating for jupyter notebooks (@wisygig)

patrick-miller commented 7 years ago

Once we are happy with the final notebook, we should do a final cleanup (there is still a warning in the import section) and then I will do a PR into mlworkers.

patrick-miller commented 7 years ago

@rdvelazquez @dhimmel @wisygig

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend. I think Ryan has made very good progress, and our focus should probably be put towards making sure that his changes work on our AWS instances.

dhimmel commented 7 years ago

Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend

I agree & fully support!

rdvelazquez commented 7 years ago

I was going to open a PR to push the notebook to production this week/weekend.

Sounds good to me!

patrick-miller commented 7 years ago

Using the updated notebook and a small sample set, I got reasonable results back in production. We are getting the following warning for a pandas import:

I'm going to run a larger query, to see if we get memory issues.

patrick-miller commented 7 years ago

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

@dhimmel @rdvelazquez

dhimmel commented 7 years ago

The larger query for TP53 as the gene and all diseases selected runs into a memory error.

Okay @kurtwheeler and I will discuss our options tomorrow and let you know what we're thinking.

patrick-miller commented 7 years ago

Any updates on this? Should we reduce the hyperparameter search space? We could probably cut it in half, by looking at alpha for every 0.2 instead of 0.1.

dhimmel commented 7 years ago

Any updates on this?

I'm hoping to get to this this afternoon. What I'm thinking is run the pathological query (TP53 as the gene and all diseases selected) locally and see how much memory is consumed. I would use this technique which @yl565 previously implemented.

This will let us know whether our AWS image size is too small or there is another issue where memory isn't fully allocated.

Should we reduce the hyperparameter search space?

Does this affect memory usage now that we're using dask-searchcv?

patrick-miller commented 7 years ago

Does this affect memory usage now that we're using dask-searchcv?

Oh, I'm not sure, I just assumed that's where the memory burden still was. I think @rdvelazquez would have a better answer.

rdvelazquez commented 7 years ago

I assume that the hyperparameter space affects memory usage. I'm not positive but I don't see how it wouldn't. I also assume that the number of PCA components has a much greater affect on memory than alpha_range so we may also need to evaluate the number of PCA components.

Once we have the notebook code set-up to be profiled we can easily adjust the hyperparameter space and re-profile to quantify how changes to the hyperparameter space affects memory (if at all).

dhimmel commented 7 years ago

So I installed memory profiler (pip install memory_profiler) and then used the %%memit notebook magic. Here's the HTML export of the notebook: 2.mutation-classifier-1-job.html.txt. Reading the files consumed 4.5 GB, which increased to 6.5 after making the training / testing dataset. Fitting the default models peaked at 11.7 GB (assuming a 1000 to 1 Mebibytes to Gigabytes conversion, which is slightly off). So @kurtwheeler, looks like we must upgrade the AWS instance size and limit them to 1 job at a time.

Increasing n_jobs=4 increased peak memory to ~16 GB and gave the repeated warning:

/home/dhimmel/anaconda3/envs/cognoma-machine-learning/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)

So we can keep at n_jobs=1 for now.

kurtwheeler commented 7 years ago

I've changed the cognoma EC2 instances from m4.large to r4.large which increased the RAM from 8 GB to 15.25 GB.

rdvelazquez commented 7 years ago

FYI - I'm getting an error on cognoma.org when I try to search for diseases: "Failed to load diseases." in a pink bar across the top. error

dhimmel commented 7 years ago

Me too! https://api.cognoma.org/diseases/ returns a 503 code.

Failed to load resource: the server responded with a status of 503 (Service Unavailable: Back-end server is at capacity)
disease-type:1 Failed to load https://api.cognoma.org/diseases/: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://cognoma.org' is therefore not allowed access. The response had HTTP status code 503.

@kurtwheeler and I will look into what failed tomorrow!

rdvelazquez commented 7 years ago

Sounds good. If the issues seems difficult to track down or fix you could consider rewinding #9 and just changing the EC2 size for now. The other changes from #9 could then be troubleshooted after the launch party... with a little less pressure :wink:

dhimmel commented 7 years ago

@rdvelazquez https://api.cognoma.org should now be back up. @kurtwheeler fixed it this morning. We had changed the instance type, but had not destroyed and recreated the instances (which ECS apparently requires).

patrick-miller commented 7 years ago

Awesome! I can confirm that it works on TP53 with all diseases included:

rdvelazquez commented 7 years ago

It's pretty fast too!

cognoma / machine-learning

Machine Learning Punch List for Launch #110