Get a model build process going for multiple models in the database

lauralorenz commented 8 years ago

Right now we have a management command manage.py train that supports building a model against a corpus from disk and saving it to the database using the arbiter Estimator and Score models. With this issue, we want to be able to create models against documents in the database, and support the ability for multiple models to use the same documents and know which documents they used.

I think the intended implementation strategy for this is to

build a reader like/subclassed from/expanding TranscriptCorpusReader that can pull a corpus from the database via a queryset as opposed to from disk
provide a way for the model management command to ingest a queryset specification and utilize it when constructing the corpus
add necessary features to the build process and attributes to the Estimator model to be able to track an Estimators dependent documents
- may want to consider what happens if the documents change underneath the Estimator instance; do we version the documents or each Estimators input data? Do we care about reproducibility of each Estimator at this granularity?

This issue is closed when a model build process can be run against the stored documents in the database and track which documents were used for each model.

bbengfort commented 8 years ago

My proposal is as follows:

[x] Corpus Model to which estimators have an foreign key
[x] ManyToMany relationship between Corpus and Document models
[x] QueryCorpusReader (accepts either a Query or a Corpus object)
[x] Ensure the Corpus Loader works with the QueryCorpusReader (generators being the main concern).

These three tasks seem to meet the specification of the requirement.

This will work for now, so long as the corpora are small; there is no memory issue for reads (it's streaming) but too many database queries can slow down performance.

bbengfort commented 8 years ago

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...

lauralorenz commented 8 years ago

Nope it shouldn't block me I can use the disk read models. This one blocks #15 (which is more cosmetic anyways) and otherwise I think this milestone can be worked on without this.

Sent from my iPhone

On Aug 8, 2016, at 4:52 PM, Benjamin Bengfort notifications@github.com wrote:

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bbengfort commented 8 years ago

Figuring out labels according to votes shouldn't be tough ...

select d.id, d.title, l.name, count(l.id) from annotations a 
        join documents d on a.document_id = d.id
        join labels l on a.label_id = l.id
    group by d.title, l.name, d.id;

But how to do this in Django?

bbengfort commented 8 years ago

Ok, models can now be built as follows:

from corpus.models import *
from corpus.reader import *
from corpus.learn import *
from django.contrib.auth.models import User
from sklearn.linear_model import LogisticRegression as model

# Create a corpus for a specific user (database operation)
user = User.objects.get(username='bbengfort') 
corpus = Corpus.objects.create(user)

# Instantiate a query corpus reader from the corpus model object 
# As well as the loader that can use that reader 
reader = CorpusModelReader(corpus)
loader = CorpusLoader(reader, folds=2)

# Build the model 
(clf, scores), total_time = build_model(loader, model)

E.g. we're now building models from the documents that are in the database. Note that we still need something to save the estimator that is built into the database, and I haven't engaged this in a view because it takes minutes to do; but I think this issue is complete.

bbengfort commented 8 years ago

@lauralorenz discuss then move to done?

lauralorenz commented 8 years ago

@bbengfort Sure. Am I missing something or where the branch at

bbengfort commented 8 years ago

I merged my branch into develop

bbengfort commented 8 years ago

@lauralorenz put some inline comments into the commit.

bbengfort commented 8 years ago

At this point we need to:

[x] allow creation of labeled and unlabeled corpora (excluding None labels for labeled)
[x] expand the M2M relationship between corpus and documents to hardcode label
[x] save the corpus with the estimator during model build
[x] expand the management command to build a model for a user, for the debates, or for the entire corpus

In order to do a build in the view (e.g. the user clicks a button) we'd need Celery, and ideally have #13 in place so that we could specify progress (and have a place for that button).

So I'd suggest that after expanding the Django management command, we simply create a new issue for that and call this one good?

lauralorenz commented 8 years ago

Yeah I agree with all of that. Yes let's punt on the view/Celery version for model builds for now.

bbengfort commented 8 years ago

@lauralorenz -- ok just pushed the release with this. Things should be working but more testing is required. I'll move this to done for right now; let me know if you have any trouble with the CLI.

DistrictDataLabs / partisan-discourse

Get a model build process going for multiple models in the database #12