Design the new pipeline - Githubissues

e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.

https://e-mission.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

15 stars 34 forks source link

Design the new pipeline #262

Open shankari opened 7 years ago

shankari commented 7 years ago

This issue documents the design discussion and choices for updating the mode inference pipeline to the new data model.

shankari commented 7 years ago

Design goals:

Separate model building and model application steps
Seed model building from old-style (Moves) data
Support multiple models in parallel

shankari commented 7 years ago

Both 1. and 2. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966304 imply that we should save the model and re-load it. We can use either standard pickle or jsonpickle to convert the model to json (with the caveat that it is custom to a particular version of scikit-learn)

But how should we store these models?

Store to disk
Store to database

Once we move to user-specific models, we can store them to the database. In fact, if we generate models periodically, we might want to store the history of them in the database. But right now, since we don't allow users to edit/confirm modes, we will start with the seed model built from old-style (Moves) data in the Backup and Section databases. Since we don't generate any more Moves-style data, this model will never change.

So should it be stored on disk?

shankari commented 7 years ago

In order to answer that question, we need to think about how the seed database will be

Another thing to note is that we may want to continue to use the old-style (Moves) data until we build up a critical mass of new data. I just checked, and it is 14104 + 7439 = 21543 entries, which is pretty good. It looks like we can combine random forest models https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn Not sure about other models from scikit-learn.

So the options for seeding are:

Read saved model + combine with newly created model = combined model
Read old data + combine with new data = combined model

Both of those seem reasonable, but 1. seems slightly better because then we can export the model as a seed for other server instances without sharing the raw data.

shankari commented 7 years ago

ok, so given that we are going with 1. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314968818, we will really have a static seed model and can store it however we want. I'm tempted to store it in a file in the current directory, just to keep things simple.

shankari commented 7 years ago

Saved it as 'seed_model.json' (https://github.com/e-mission/e-mission-server/commit/1149631d80bd967dcf85f4a30fc998c6e01fe16d)

shankari commented 7 years ago

Note also that the current pipeline only loads the backup data if there is insufficient "new" data. But the backup data is ~ 1/3rd of the total (7439 / 21543 = .3453) and it seems sad to lose it. Seems like it would be good to support both.

Again, there are two ways of doing this:

Copy all confirmed sections into one "training set" database
Support loading from two databases

Option 2. appears to be the least work at this time, although we may want to revisit this once we start experiments on the analysis.

shankari commented 7 years ago

Now we move on to the real pipeline. The current code can be broadly divided into:

extract features
model
infer

We can split these into three files corresponding to those stages, but really, they are specific to this algorithm (aggregate random forest with a specific set of features). Other algorithms will have potentially have different implementations of each of them.

Let's just do this for one algorithm first and then extend to multiple algorithms in a later step. Note that we will need to generate seed models for each of the other algorithms as well, so its good to do that in a separate step.

shankari commented 7 years ago

ok so now that we are building the real pipeline, we need to figure out how to store the newly created models. The plan from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966462 was to store periodically generated models to the database.

It is fairly clear how to do this for user-specific models, but not for generic models that are built on aggregate data. Our timeseries is all focused on users, and our aggregate is simply a query across users.

shankari commented 7 years ago

Basically, we need to create a new non-user-specific timeseries. We can do this in at least three ways:

use the None user_id
create a specific tag that represents no user
create a separate database for non-user-specific entries

The easiest option is probably (2). The most principled option is probably (3). As long as we can get the interface right, there isn't much to choose between the options.

shankari commented 7 years ago

Last thing we need to figure out to finish the new model building part is to decide how we will store the confirmed values. That will allow us to query sections that have been confirmed.

shankari commented 7 years ago

We haven't yet figured out what kinds of edits we want to do, so this is a bit tricky. But in order to move past this really complicated issue, I am going to assume that we only support mode edits/confirmations. We will represent this using a manual/confirm_mode type entry.