dotmesh-io / roadmap

Roadmap: where we take the product in the next 1-6 months in terms of features
0 stars 0 forks source link

Datadots API/SDK to make pipelines (CI, data science) faster & more reproducible #6

Open lukemarsden opened 6 years ago

lukemarsden commented 6 years ago

More description coming soon

Design doc

mhausenblas commented 6 years ago

I'm very interested in this and would be applying it in the context of KAML-D.

lukemarsden commented 6 years ago

Thanks @mhausenblas! Updated the title to be more generic and include data science as a class of pipeline :)

lukemarsden commented 6 years ago

@mhausenblas could you write a little about your use case here please? Would be awesome if you could phrase it as user stories 🙏

lukemarsden commented 6 years ago

In particular, why are snapshots important to kamld?

mhausenblas commented 6 years ago

OK, here goes:

So the problem KAML-D tries to tackle is the double divide, and there especially the data scientist -> data engineer/developer gap. Imagine a team that works on an app (yeah, there's an app for that) that has a machine learning feature, for example a face or voice recognition task. Now, the awesome data scientists develop their model, using for example R or in Python. How do they go about it? Glad you asked. They get some training data (let's say some CSV dump or maybe a ZIP with a gazillion .png images) and start screwing around, erm, choosing the a good machine learning approach (like unsupervised or reinforcement learning).

Every time the data scientists iterate, they adjust the training data, maybe cleaning up, adding more data or whatever. Then they of course need to split out a part, maybe 30% into some test data. So, guess what? They're literally copying the original dataset (say myawesomedata/), remove/add stuff as they need and store it under a new name, maybe myawesomedata1/ or even fancy stuff like myawesomedata_2018-03-02_10am/. Congrats, you've just (re-)invented versioning. In a really bad way, FWIW.

And this is just the learning/training phase. Once they have the model, they need to serialise it to make it available for the data engineers/devs to actual (re-implement) it in another language and/or environment such as Spark or Flink (Scala) to make it production ready. No matter if they're using a proprietary format such as Tensorflow's checkpoint files or interchange formats such as ONNX or CoreML, the model can and will change (drivers may include: new data, a better model or algorithm, etc.) and that updated model needs versioning, again, same as the dataset above.

Now, one could argue that they could use GitHub to capture the respective dataset at any given point in time but did you know GitHub has a limit of 100MB per file? Ah, OK, so I just put it on S3 and enable versioning of the bucket! Sure you can do that, if you'd like to be locked into S3 ;)

So, what we're looking for are a system that meets the following requirements:

  1. Programmatically take snapshots of datasets and models in a portable way
  2. The snapshots should not be limited in terms of size (let's be pragmatic, say at least a couple of TB)
  3. The ability to roll back and forward in time for both datasets and serialized models
  4. A nice UX (CLI and/or UI) that allows to manage the snapshots

I believe that the Datadots API can help at least with 1. to 3. so that's why it would be very dope to have it available.

lukemarsden commented 6 years ago

This may also be useful in serverless contexts. We should investigate.

deitch commented 6 years ago

@mhausenblas the link to kamld.com appears to be dead? I get the GitHub Pages 404.

I like how you painted the data scientist and data engineer "we reinvented versioning" picture, and the divide. That really makes it clear. (did I say we are lucky to have Michael explaining things to us?)

@lukemarsden so is the API a programmatic way to do all of the things that you can do with dm CLI? Is it the dm CLI plus interacting with dothub? Or is it digging deeper into the contents of dots and subdots themselves?

deitch commented 6 years ago

This may also be useful in serverless contexts.

It every well may. As serverless really is young, lots of people still are figuring things out, hence lots of open and fast-moving conferences. We could check out a number of them just to learn, or to present.

mhausenblas commented 6 years ago

@deitch oooops, moved to http://design.kamld.com

deitch commented 6 years ago

Looks good @mhausenblas . I definitely have a better understanding now. Thanks!

mhausenblas commented 6 years ago

Any updates here? Also, in today's OpenShift ML SIG we saw a Pachyderm demo, they have this functionality ;)