Benjamin-Lee / deep-rules

Ten Quick Tips for Deep Learning in Biology
https://benjamin-lee.github.io/deep-rules/
Other
227 stars 45 forks source link

Future-proof: Plan how you will support and share your model #63

Open smsaladi opened 5 years ago

smsaladi commented 5 years ago

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Much of the discussion this far has concerned with creating/crafting a model, but as far as I can tell, there hasn't been much discussion on what happens afterwards (perhaps, let's say, after you publish the model). Since lots of deep learning is done to solve problems, making the model available as a tool seems just as important as crafting it.

This conversation is not unique to deep learning, but I think it's especially relevant because the hard work being done to make the application layer (e.g. Tensorflow, pytorch) runnable across machines/architectures and to support it.

In the past, there's been a question of "can you get everything running on your machine", say to process the input data in the format necessary for inference (in my experience, this is not trivial) and then actually run the inference code. However, with deep learning, often times the input is the raw data itself (perhaps with a tad bit of processing). In my experience with sequence bioinformatics, people are building models that use the DNA or protein sequence itself, without any sort of featurization calculations/pre-processing.

With a deep learning model, if it's built on a "standard" platform, sharing the model and running it oneself is significantly simplified.

When building a deep learning model (or perhaps shortly after), my suggestion is that the authors think about how they will share it with the user community. Here are some ideas:

And, how will they maintain the application going forward? What if they (or someone else) puts out a model that's performs better at the same benchmarks? Will they turn their service off if the second model is made available? Will they host this second model?

PS. This is an awesome effort!

rasbt commented 5 years ago

This is interesting. I think these are general questions someone might ask for any sort of machine learning in any kind of field though. So, maybe we want to highlight that in DL (compared to other forms of modeling), it's esp. worth thinking about this because of

a) the typically large resources required for retraining (assuming code and instructions are available)

b) the results are usually not deterministic (non-convex optimization + non-deterministic behavior even for fixed random seeds due to heuristics in winnowgrad)

So in this case, there should be more incentive to share the model weights along with the model code compared to e.g., linear regression weights, where this is more trivial, where someone could just provide the dataset or model weights.

Now the question is how we make this more specific to DL in Bio? Maybe because datasets are smaller and noisier, and then the initial, random weights are even more important to converge to a local minima that is useful? So, in this case, someone should think "even more" about sharing the model.

agitter commented 5 years ago

Will they provide the model, weights, and a code example of how to use it for inference (least convenient)?

Kipoi is intended to address this subset of your questions. They provide conda-based infrastructure and specifications to run trained models, do post processing, benchmark models, and many other related things. See also #21.

@evancofer may be able to comment on how Selene fits in here.

Benjamin-Lee commented 5 years ago

By the way, @smsaladi, do you mind making a PR adding your name to contributors.md so we can make sure to acknowledge your contributions?

blengerich commented 5 years ago

@smsaladi @rasbt Nice points. I agree that it's important for MLers to define the intended use case of their models and "future-proof" according.

Another aspect which makes "future-proofing" important for scientific/biological inquiries is that our experimental goal is often to understand phenomena. This means that our job is not necessarily finished after training an accurate model. Instead, we can consider the model itself as experimental data for research parasites (@cgreene et al) to analyze. Because DL models have many parameters, there can be a lot of fruit for these follow-on studies to harvest (#36), so designing an experimentally-guided plan for dissemination is important here. As @agitter mentioned, Kipoi can ease the logistics.

evancofer commented 5 years ago

@agitter I think they have slightly orthogonal goals. Whereas Kipoi eases maintainability/reuse of a user's implementations, Selene eases the implementation. As such, it may be more relevant to a discussion of tools for deep learning in genomics than this discussion of tools for maintainability.

fmaguire commented 5 years ago

Mentioned loosely in tip 3, not sure enough to close though.