Open shankari opened 7 years ago
Design goals:
Both 1. and 2. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966304 imply that we should save the model and re-load it.
We can use either standard pickle
or jsonpickle
to convert the model to json (with the caveat that it is custom to a particular version of scikit-learn)
But how should we store these models?
Once we move to user-specific models, we can store them to the database. In fact, if we generate models periodically, we might want to store the history of them in the database. But right now, since we don't allow users to edit/confirm modes, we will start with the seed model built from old-style (Moves) data in the Backup and Section databases. Since we don't generate any more Moves-style data, this model will never change.
So should it be stored on disk?
In order to answer that question, we need to think about how the seed database will be
Another thing to note is that we may want to continue to use the old-style (Moves) data until we build up a critical mass of new data. I just checked, and it is 14104 + 7439 = 21543
entries, which is pretty good. It looks like we can combine random forest models
https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn
Not sure about other models from scikit-learn.
So the options for seeding are:
Both of those seem reasonable, but 1. seems slightly better because then we can export the model as a seed for other server instances without sharing the raw data.
ok, so given that we are going with 1. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314968818, we will really have a static seed model and can store it however we want. I'm tempted to store it in a file in the current directory, just to keep things simple.
Saved it as 'seed_model.json'
(https://github.com/e-mission/e-mission-server/commit/1149631d80bd967dcf85f4a30fc998c6e01fe16d)
Note also that the current pipeline only loads the backup data if there is insufficient "new" data. But the backup data is ~ 1/3rd of the total (7439 / 21543 = .3453
) and it seems sad to lose it. Seems like it would be good to support both.
Again, there are two ways of doing this:
Option 2. appears to be the least work at this time, although we may want to revisit this once we start experiments on the analysis.
Now we move on to the real pipeline. The current code can be broadly divided into:
We can split these into three files corresponding to those stages, but really, they are specific to this algorithm (aggregate random forest with a specific set of features). Other algorithms will have potentially have different implementations of each of them.
Let's just do this for one algorithm first and then extend to multiple algorithms in a later step. Note that we will need to generate seed models for each of the other algorithms as well, so its good to do that in a separate step.
ok so now that we are building the real pipeline, we need to figure out how to store the newly created models. The plan from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966462 was to store periodically generated models to the database.
It is fairly clear how to do this for user-specific models, but not for generic models that are built on aggregate data. Our timeseries is all focused on users, and our aggregate is simply a query across users.
Basically, we need to create a new non-user-specific timeseries. We can do this in at least three ways:
None
user_idThe easiest option is probably (2). The most principled option is probably (3). As long as we can get the interface right, there isn't much to choose between the options.
Last thing we need to figure out to finish the new model building part is to decide how we will store the confirmed values. That will allow us to query sections that have been confirmed.
We haven't yet figured out what kinds of edits we want to do, so this is a bit tricky. But in order to move past this really complicated issue, I am going to assume that we only support mode edits/confirmations. We will represent this using a manual/confirm_mode
type entry.
This issue documents the design discussion and choices for updating the mode inference pipeline to the new data model.