galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.53k stars 1.06k forks source link

[RFC] Runtime predictions #5873

Open atyryshkina opened 7 years ago

atyryshkina commented 7 years ago

Hello,

I want to give everyone an update on the runtime prediction project, get some feedback, and present a sample of how we can expect the tool to behave on the Galaxy.

This past week and a half I’ve been working on making improvements and increasing robustness of the prediction model. I’ve settled on using a model called a Quantile Random Forest, which allows calculation of prediction intervals based on the variability in the training data. The main difference between a Quantile Random Forest and a Random Forest is that a QRF stores all of the training data in the leaves of its trees while a regular RF only stores the means of the training data. This lets us calculate the prediction quantiles for every individual run. (We would be able to say something like, “90% of all jobs that made it to these leaves took between x and y seconds to complete”)

Because of the amount of data it stores in the prediction object, the quantile random forest will take up significantly more space than a regular Random Forest. From the ones I’ve been working with, they typically take 70 Mb of memory per tool compared to 20 Mb regular Random Forest. The QRF takes ~1 millisecond to make a prediction.

Marten and I have discussed how this would work on the Galaxy Project Main.

We see two approaches:

The next bit is important. I want to make sure everyone knows how the tool performs, and that everyone is happy with the performance.

I trained a QRF on 20,000 instances of bwa_mem. Then I made predictions on input data that it had not seen before, all instances from March 24 - April 5, and I used prediction intervals of 90% and 80%.

The accuracy of the model on the previously unseen data is 50%. (50% of the actual runtimes fell within the 90% prediction interval).

Attached are the sample model predictions. -> bwa_mem_prediction_sample.txt

I think it’s fairly self-explanatory. The first column is the id of the run. The second column is the number of minutes the run took. The third column is the model’s 90% prediction interval. The fourth is the model’s 80% prediction interval. And the last column is the model’s mean prediction interval.

And here are some graphs of the data for your convenience.

image

image

image

We here are unsure if these results are useful enough to implement on Galaxy. Let us know your thoughts. Whether we should continue on this path, or we should drop it.

Thanks, Anastasia

martenson commented 7 years ago

Is this something that usegalaxy.eu would be interested in to build on their data? @erasche @bgruening

@atyryshkina managed to squeeze the tool-predict-pickle to 6MB now and is trying to do more. The process still has deps on numpy, scikit, sklearn and scikit-garden which we probably don't want to introduce to Galaxy, but maybe as conditional deps for this feature it wouldbe fine? We are trying to decide whether this could be in the instance memory or whether the prediction should be provided as a service from a different program/Galaxy @natefoo @jmchilton

edit: this is what we are trying to give to user on job submission:

95% of jobs that look like yours took between 30 and 45 minutes to complete

bgruening commented 7 years ago

Yes this is something we are interested in.

Indeed we have a webhook that does something similar. But we decided to stay out of the prediction business as my gut feeling said we are missing a lot data to make this accurate. I'm worried about different CPUs, different parameters, different versions etc ... So what we do is we plot a very simple graph of already observed run times. This means we do not predict, but we give the user an indication how long (and how often) this tool has run before, very very simple. This data is even pre-calculated so the webhook is just reading a json.

I'm happy to be proven wrong!

A few things to consider:

martenson commented 7 years ago

Thanks for the feedback @bgruening

we could use GRT for the data

would be great down the road, but for now we are estimating feasibility, so we access data directly from Main's DB

we could pre-calculate the model/predictions in a separate process

Gaining what? The pickle is the representation of the prediction object, if we can afford to lose precision we can make it smaller (which @atyryshkina is investigating now).

we could do this work workflows as well

Phase x+1, I think. Nice idea though. I would expect the training data would be scarce though.

martenson commented 7 years ago

But we decided to stay out of the prediction business as my gut feeling said we are missing a lot data to make this accurate.

@atyryshkina can give you more info but for the cases that she is looking in (bwa_mem, groomer, bowtie) the really dominant predictor was the filesize afaik

jmchilton commented 7 years ago

@atyryshkina has done really awesome work, I'm super impressed and excited about the predictive power.

Rather than rushing something into the GUI where we don't even have a representation of jobs. I'd rather see a foundation built for taking advantage of this data in different ways with a focus on admins first. If we were simply adding a little noise in our submission request parameters (core count mostly) and recording the memory jobs consumed - this would be hugely useful is analyzing our CPU usage, based on the preliminary data it seemed like we are allocating too many cores for multiple tools. Better allocation could speed up everyones jobs.

I think the next stage of this work should focus on:

If we target admins - I think we can do a really awesome job building both an abstract pluggable framework for actionable intelligence that would both extend beyond Galaxy while also providing concrete sharable information about bioinformatics apps people actually run today. Like @bgruening I'm less convinced we can provide really great insights to end users at this stage - but I understand that is the hope everyone has so I'm fine being called a pessimist and I really do hope to be proven wrong. This is great work, thanks again @atyryshkina.

atyryshkina commented 7 years ago

Yep, for bwa_mem, groomer, and bowtie the filesize is the most important feature the Random Forest considers.

Here are the feature importance for bwa_mem -> bwa_mem_feature_importances.txt The feature importance are calculated by counting the number of times each feature is used to split branches of the trees of the forest.

One interesting variable we do not have access to is the error rate allowed for the bwa_mem alignment. "BWA will be very slow if r is high because in this case BWA has to visit hits with many differences and looking for these hits is expensive" (from here) This is one reason two jobs that look the same can have vastly different runtimes. And the error rate is something that would not be available to us until after the job is finished.

I haven't been able to find similar info for other tools, but I'm sure many have these types of hidden variables that would not be immediately available to us.

afgane commented 7 years ago

I feel separating this out into its own service would be more desirable than integrating it within Galaxy. Here are a few reasons:

martenson commented 7 years ago

@afgane But the data is always project-specific and so are the predictions. Are you suggesting we aim to run this as a service for anybody that supplies data?

afgane commented 7 years ago

Right, probably resource-specific too. Still though, this seems like a nice way to encourage data contributions from other Galaxy instances as a way to gather more data. For example, smaller instances may not have sufficient data on their own to generate useful info but when inspected in the context of more data, perhaps something comes out of it. All along, if filtering is enabled to limit the predictions to a specific server or some other fact, it won't hurt those that want to operate on a reduced/focused set of data.

As mentioned higher up, in combination with the GRT, yes anyone could then both contribute and extract data.

jmchilton commented 7 years ago

But the data is always project-specific and so are the predictions.

@afgane cedes this point but I absolute do not. My whole point is this data is more valuable to developers and admins than to Galaxy users. The particular predictions aren't as interesting as what factors are used in the models. The predictions will become dated and are hard to utilize currently given the flexibility admins have in Galaxy - the data, the factors, the coefficients used to generate the predictions are really quite interesting.

Let's say I'm a bioinformatics application developer - maybe Galaxy isn't particularly interesting to me because I like Makefiles or Toil or bash ... still this data and the model for predictions sure might be interesting. Who has the time or resources to collect all of this data - and then try to figure out what is useful in predicting runtimes and memory usage. I'd guess the typical bioinformatics developer does not. If I see that across many different resources and many different tool versions and even tool types (e.g. CWL vs Galaxy) that core count isn't useful as a factor for prediction past a certain point or if I see some parameter is hugely important that I don't expect - those are both really actionable interesting things in very different ways.

If instead I'm an application support specialist at an institute like my former employer MSI, I again may not be particularly interested in Galaxy but I have even less time to figure out how to predict the runtime of an application like bwa that I am deploying and documenting and convey that to my users. This resources gives them a serious place to start. This resource is something we could be contributing back to the broader bioinformatics community. Beyond this general concern about providing a community resources because this should be our mission, application developers and support personal getting this information from Galaxy tools is really great for us - it raises our profile, makes it clear there are serious upsides to contributing to tool wrapping efforts and creating applications that map more cleanly to Galaxy tools. We want application developers to want their uses to use Galaxy - that is really important and we aren't currently doing anything really to incentivize that.

I'm trying to understand why I'm pushing to go big and general here when I generally push for smaller, more application specific pragmatic solutions - I think the answer is that we don't have a competitor in this space - unless I'm missing something (very possible). Who has all this random data, at this scale, over so many tools - and keeps track of it in such a structured way - I think only us right? If we become useful to application developers and admins, we could become indispensable, and then funding agencies have to give us money ... and money to do something broadly useful and maximally open the best kind of money!

afgane commented 7 years ago

As you point out John, specificity to a project or resource or whatever is not necessarily bad, but it is a likely influence factor. Some people may want to focus on their data/resource/whatever (hence the filtering option) but, yes, others may want just more data. To me, the idea of a (standalone) service for fetching the data and allowing (arbitrary) operations on it feels like a higher level service on top of something like GRT, making that data more accessible and generally usable. If we can pull off the appeal from people other than Galaxy here, all the better.

martenson commented 7 years ago

Just to be clear: I am not trying to squeeze the scope or discourage big thinking but rather to stay focused on the [RFC] Runtime predictions topic. I think GRT is the project we should be pushing for to achieve that the community can share interesting data in a useful structure.

However a little bit of pragmatism could be a good choice here since GRT did not make it to the roadmap but prediction engines did (in form of @atyryshkina). If we commit too big we might be putting constraints on what she can do 'for the greater good' and I would like to prevent that. Hopefully I am wrong.

Many ideas mentioned here are great, keep them coming.