Closed lauralorenz closed 8 years ago
My proposal is as follows:
These three tasks seem to meet the specification of the requirement.
This will work for now, so long as the corpora are small; there is no memory issue for reads (it's streaming) but too many database queries can slow down performance.
@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...
Nope it shouldn't block me I can use the disk read models. This one blocks #15 (which is more cosmetic anyways) and otherwise I think this milestone can be worked on without this.
Sent from my iPhone
On Aug 8, 2016, at 4:52 PM, Benjamin Bengfort notifications@github.com wrote:
@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Figuring out labels according to votes shouldn't be tough ...
select d.id, d.title, l.name, count(l.id) from annotations a
join documents d on a.document_id = d.id
join labels l on a.label_id = l.id
group by d.title, l.name, d.id;
But how to do this in Django?
Ok, models can now be built as follows:
from corpus.models import *
from corpus.reader import *
from corpus.learn import *
from django.contrib.auth.models import User
from sklearn.linear_model import LogisticRegression as model
# Create a corpus for a specific user (database operation)
user = User.objects.get(username='bbengfort')
corpus = Corpus.objects.create(user)
# Instantiate a query corpus reader from the corpus model object
# As well as the loader that can use that reader
reader = CorpusModelReader(corpus)
loader = CorpusLoader(reader, folds=2)
# Build the model
(clf, scores), total_time = build_model(loader, model)
E.g. we're now building models from the documents that are in the database. Note that we still need something to save the estimator that is built into the database, and I haven't engaged this in a view because it takes minutes to do; but I think this issue is complete.
@lauralorenz discuss then move to done?
@bbengfort Sure. Am I missing something or where the branch at
I merged my branch into develop
@lauralorenz put some inline comments into the commit.
At this point we need to:
In order to do a build in the view (e.g. the user clicks a button) we'd need Celery, and ideally have #13 in place so that we could specify progress (and have a place for that button).
So I'd suggest that after expanding the Django management command, we simply create a new issue for that and call this one good?
Yeah I agree with all of that. Yes let's punt on the view/Celery version for model builds for now.
@lauralorenz -- ok just pushed the release with this. Things should be working but more testing is required. I'll move this to done for right now; let me know if you have any trouble with the CLI.
Right now we have a management command manage.py train that supports building a model against a corpus from disk and saving it to the database using the
arbiter
Estimator
andScore
models. With this issue, we want to be able to create models against documents in the database, and support the ability for multiple models to use the same documents and know which documents they used.I think the intended implementation strategy for this is to
TranscriptCorpusReader
that can pull a corpus from the database via a queryset as opposed to from diskEstimator
model to be able to track anEstimator
s dependent documentsEstimator
instance; do we version the documents or eachEstimator
s input data? Do we care about reproducibility of eachEstimator
at this granularity?This issue is closed when a model build process can be run against the stored documents in the database and track which documents were used for each model.