Closed rdvelazquez closed 7 years ago
Once we are happy with the final notebook, we should do a final cleanup (there is still a warning in the import section) and then I will do a PR into mlworkers
.
@rdvelazquez @dhimmel @wisygig
Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend. I think Ryan has made very good progress, and our focus should probably be put towards making sure that his changes work on our AWS instances.
Because the launch party is coming up, I was going to open a PR to push the notebook to production this week/weekend
I agree & fully support!
I was going to open a PR to push the notebook to production this week/weekend.
Sounds good to me!
Using the updated notebook and a small sample set, I got reasonable results back in production. We are getting the following warning for a pandas import:
I'm going to run a larger query, to see if we get memory issues.
The larger query for TP53 as the gene and all diseases selected runs into a memory error.
@dhimmel @rdvelazquez
The larger query for TP53 as the gene and all diseases selected runs into a memory error.
Okay @kurtwheeler and I will discuss our options tomorrow and let you know what we're thinking.
Any updates on this? Should we reduce the hyperparameter search space? We could probably cut it in half, by looking at alpha for every 0.2 instead of 0.1.
Any updates on this?
I'm hoping to get to this this afternoon. What I'm thinking is run the pathological query (TP53 as the gene and all diseases selected) locally and see how much memory is consumed. I would use this technique which @yl565 previously implemented.
This will let us know whether our AWS image size is too small or there is another issue where memory isn't fully allocated.
Should we reduce the hyperparameter search space?
Does this affect memory usage now that we're using dask-searchcv
?
Does this affect memory usage now that we're using dask-searchcv?
Oh, I'm not sure, I just assumed that's where the memory burden still was. I think @rdvelazquez would have a better answer.
I assume that the hyperparameter space affects memory usage. I'm not positive but I don't see how it wouldn't. I also assume that the number of PCA components has a much greater affect on memory than alpha_range so we may also need to evaluate the number of PCA components.
Once we have the notebook code set-up to be profiled we can easily adjust the hyperparameter space and re-profile to quantify how changes to the hyperparameter space affects memory (if at all).
So I installed memory profiler (pip install memory_profiler
) and then used the %%memit
notebook magic. Here's the HTML export of the notebook: 2.mutation-classifier-1-job.html.txt. Reading the files consumed 4.5 GB, which increased to 6.5 after making the training / testing dataset. Fitting the default models peaked at 11.7 GB (assuming a 1000 to 1 Mebibytes to Gigabytes conversion, which is slightly off). So @kurtwheeler, looks like we must upgrade the AWS instance size and limit them to 1 job at a time.
Increasing n_jobs=4
increased peak memory to ~16 GB and gave the repeated warning:
/home/dhimmel/anaconda3/envs/cognoma-machine-learning/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
**self._backend_args)
So we can keep at n_jobs=1
for now.
I've changed the cognoma EC2 instances from m4.large
to r4.large
which increased the RAM from 8 GB to 15.25 GB.
FYI - I'm getting an error on cognoma.org when I try to search for diseases: "Failed to load diseases." in a pink bar across the top.
Me too! https://api.cognoma.org/diseases/ returns a 503 code.
Failed to load resource: the server responded with a status of 503 (Service Unavailable: Back-end server is at capacity)
disease-type:1 Failed to load https://api.cognoma.org/diseases/: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://cognoma.org' is therefore not allowed access. The response had HTTP status code 503.
@kurtwheeler and I will look into what failed tomorrow!
@rdvelazquez https://api.cognoma.org should now be back up. @kurtwheeler fixed it this morning. We had changed the instance type, but had not destroyed and recreated the instances (which ECS apparently requires).
Awesome! I can confirm that it works on TP53 with all diseases included:
It's pretty fast too!
Here's the general punch list that we discussed at tonight's meetup for getting the machine learning part of cognoma launch ready.
To be completed at a later date: Templating for jupyter notebooks (@wisygig)