[1d] publish a python client library for printing dotscience metadata

lukemarsden commented 6 years ago

Context

We want to make it easier for people to emit the DOTSCIENCE_ annotations, and the print statements have the following problems:

They're hard to read, the json.dumps isn't obvious to a data scientist.
They rely on having variables in global scope - but parameters may be used inside a function. Technically, users can do the print statements from inside a function but it's not clear that this will work.
There's no validation of input and output datasets, or output format.

Requirements

Develop a trivial Python library and publish it on PyPI which solves the above problems. Usage should be like:

import dotscience as ds
# The following two methods throws exceptions if agent1/2 or model isn't a mountpoint
ds.input("agent1", "agent2")
ds.output("model")

# The following methods can either copy the values at call time, or keep the
# reference for completion - probably a copy is better as the user will probably
# expect the value _right now_ to be captured.
ds.metric("f-score", f_score)
ds.parameter("batch-size", batch_size)
ds.label("frobrinator", "off")

# They also return the result for handy use like this:
tensorflow.setBatchSize(ds.parameter("batch-size", 0.3))

# Multiple stats, params or labels can be passed as long as there are 2*x params
ds.metric("f-score", f_score, "batch_size", batch_size)

# Alternate calling style with **kwargs
ds.metric(a=1, b=2)
ds.parameter(c=3, d=4)

# Preview the metrics in human-readable form without publishing them to
# dotscience even if the notebook is saved
ds.debug()

# Report the metrics
ds.report()

# Report data changes, but no metrics (summary stats)
ds.report(plot=False)

The final method will print:

---
DOTSCIENCE_INPUTS=["agent1", "agent2"]
DOTSCIENCE_OUTPUTS=["model"]
DOTSCIENCE_SUMMARY={"f-score": 0.1, "batch-size": 0.9}
DOTSCIENCE_PARAMETERS={"c": 3, "d": 4}
---
Note to Jupyter users: don't forget to save your notebook in order to publish
these results to dotscience.

Open questions

metric or summary? Let's decide

satsumas commented 6 years ago

I think this is a good alternative to the print statement annotations, and I like the .report() method.

The alternate calling styles felt very different to me. In the first, you pass in a mixed tuple, ("f-score", f_score), in the second you give it a key=value pair, "f-score" = f_score. I prefer the second style as it generalises better to many params/scores, since its easier to read/debug n kwargs than a single 2n-length tuple.

RE Open Questions: I think summary is better than metric.

Does ds.report() publish to dotscience?

Further features

Append params, inputs, outputs on the fly

One thing I would like as a user is flexibility to add to the annotations throughout the notebook, instead of just having one opportunity to define all your DOTSCIENCE_LABELS, all your DOTSCIENCE_PARAMS, INPUTS and OUTPUT. I frequently want to add more as I go, and I rarely know all my inputs up front. If we enable adding more annotations, then instead of DOTSCIENCE_LABELS = <all_my_labels> it would be more intuitive to use some kind of append-like method to add more labels and params -- something like `dotscience_labels.append(just_thought_of_new-label) for each new label to be added. And the same goes for params, inputs, and outputs.

Multiple runs per notebook

Suppose I want to train my model multiple times, each with different parameterisations, in the same notebook. I want to compare each run, using the Dotscience UI. I can't do this right now, because each notebook can contain at most one run's annotations. To counter this, I'd like to be able to do something like:

Add a run, give it a name/number
Pass in that name/number to each subsequent annotation of params and summary stats
See each run plotted individually on the experiment tracker

Note that this needn't be the primary use case: if you don't give your run an ID, then all your summary stats and params are presumed to be from the same run, which is defined implicitly by the notebook when the user saves it.

lukemarsden commented 6 years ago

Does ds.report() publish to dotscience?

As it says in the sample output above, it will do after the user clicks save (or presses cmd/ctrl+S) on the notebook, yes.

lukemarsden commented 6 years ago

something like dotscience_labels.append

Calling all of the "information gathering" methods e.g. .params() more than once should probably merge the new params in with existing ones. In other words, it should build up the report as you go, allowing you to call all the methods multiple times. We should probably add a clear() method as well to allow you to reset them

lukemarsden commented 6 years ago

I want to train my model multiple times, each with different parameterisations, in the same notebook

it should be fine if you do it like this (without ids or named runs)

<do first model>
ds.input("a"); ds.output("b')
ds.param(a=1); ds.summary(b=2)

ds.clear()
<do second model>
ds.input("b"); ds.output("c")
ds.param(a=2); ds.summary(b=3)

and we could include some good examples in the dotscience python library docs :)

lukemarsden commented 6 years ago

RE Open Questions: I think summary is better than metric.

Good, let's use summary then.

lukemarsden commented 6 years ago

The alternate calling styles felt very different to me. In the first, you pass in a mixed tuple, ("f-score", f_score), in the second you give it a key=value pair, "f-score" = f_score. I prefer the second style as it generalises better to many params/scores, since its easier to read/debug n kwargs than a single 2n-length tuple.

The only downside to this is that you can't pass params that aren't valid variable names in python

e.g you can't say

ds.summary(f-score=0.9)

because hyphens aren't valid in python variables

technically you could probably do this if you really wanted hyphens or spaces

ds.summary(**{"f-score": 0.9, "metric with spaces": 3.14159})

but i don't think that's a major worry. let's start by just supporting the kwargs calling form.

satsumas commented 6 years ago

<do first model>
ds.input("a"); ds.output("b')
ds.param(a=1); ds.summary(b=2)

ds.clear()
<do second model>
ds.input("b"); ds.output("c")
ds.param(a=2); ds.summary(b=3)

Yes, that would work and avoids introducing new names.

lukemarsden commented 6 years ago

I think this issue is ready to begin implementing!

On Mon, 10 Sep 2018, 11:27 Kate Hodesdon, notifications@github.com wrote:

ds.input("a"); ds.output("b') ds.param(a=1); ds.summary(b=2) ds.clear() ds.input("b"); ds.output("c") ds.param(a=2); ds.summary(b=3) Yes, that would work and avoids introducing new names. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread .

satsumas commented 6 years ago

updated library spec and use cases here: https://docs.google.com/document/d/1Bk7cHMu1J2PwbShVHY6QchZ_l2k7fFxglMEk4hqiYKw/edit?usp=sharing

lukemarsden commented 6 years ago

The library should generate uuids for runs as well so we can distinguish reliably when runs have happened even if the stats haven't changed.

I also think we should switch to ds.publish generating a single JSON document so that the committer can be simplified.

alaric-dotmesh commented 6 years ago

@lukemarsden has a brain dump:

The python library is a global object you call methods on. It has state in the object that tracks the params/stats/etc as they are being built up in memory. When you call ds.publish() it will generate a UUID for the run and output the in-memory state along with it.

When the user saves the file, the committer scans it and notices new runs that it hasn't previously committed, and creates a new commit containing them all.

The output format that the committer reads should be a single JSON document, with some suitable markers so it can be found amongst other content, so it's "atomic" compared to our current lots-of-different-prints. Pretty printing it for human readability would be a desirable optional feature.

lukemarsden commented 6 years ago

Also need to release this on PyPI and burn this into the jupyter image we ship!

lukemarsden commented 6 years ago

Relates: https://github.com/dotmesh-io/frontend-ng/issues/240

alaric-dotmesh commented 6 years ago

I'm going to read up on PyPi and set us up a team account there (credentials in Lastpass) so I can publish the thing.

Also: Repo is private for now. I'll make it public when it's been reviewed by people more familiar with Pythonic conventions than I am.

alaric-dotmesh commented 6 years ago

Repo is now public, and CI is set up.

I'm building a docker image based on python:3 with it installed.

Per-commit images are called quay.io/dotmesh/dotscience-python3:GITHASH.

There's a manual job in the CI on gitlab to push a copy to quay.io/dotmesh/dotscience-python3:latest.

There's another manual job to publish it on pypi, but before doing that, make sure to change the version number in setup.py or it'll complain that that version already exists there!

Still to do: Burn it into the jupyter image.

alaric-dotmesh commented 6 years ago

Ok, it's now in the jupyter image as of https://github.com/dotmesh-io/jupyterlab-tensorflow/commit/ec7b1f53bc22758d90e77d9777332a8ff5b9b901

dotmesh-io / dotscience