Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
42 stars 16 forks source link

Lots of calls in preprocessor aren't limited to a specific project #154

Closed abendebury closed 10 years ago

abendebury commented 10 years ago

Basically, every query should be limited to the correct project.

keien commented 10 years ago

How do you feel about separate databases for each collection? It's the simplest and easiest solution.

keien commented 10 years ago

Actually there's still issues with creating a database on the fly, users, etc. but it's a viable option.

abendebury commented 10 years ago

I don't know, that sounds like a lot of trouble... didn't you mention something about limiting queries to a project?

keien commented 10 years ago

Yeah so I'm thinking something like this: we add an additional project_id to things like word_in_sentence, dependency_in_sentence, sequence_in_sentence, etc., then either use alternate joins or just write methods to scope the queries using an active_project variable that is accessible across requests (probably in the Project model) that stores the active project ID.

abendebury commented 10 years ago

active_project doesn't seem like a good idea if we end up processing multiple projects concurrently. I think the alternate joins are a much better solution.

keien commented 10 years ago

Well the alternate joins still need an active_project variable to scope with. What do you mean by processing multiple projects simultaneously? Like doing comparisons across projects? Because that could turn ugly very quickly.

abendebury commented 10 years ago

No, I mean if we're preprocessing two projects at once, wouldn't it be an issue that there are two "active" projects?

keien commented 10 years ago

I'm not exactly sure how we'd even preprocess two projects at once. I haven't implemented any kind of threading to do anything like that.

abendebury commented 10 years ago

I did, from the front end.

keien commented 10 years ago

Oh I see. Well, as long as we note it somewhere, it shouldn't be a problem because we don't make any queries that concern this issue in the preprocessor. I just need to change the sentence.add_[stuff] methods to also take in a project, and pass around the project ID wherever necessary.

abendebury commented 10 years ago

What do you mean when you say that we don't make queries that concern this in the preprocessor? Aren't we discussing the queries in the preprocessor?

keien commented 10 years ago

As far as I know, we never use word.sentence, word.sequences and the like because we mostly only do writes in the preprocessor. It's more of a concern in the main application.

keien commented 10 years ago

Also, alternate joins aren't working out too well, so I might just write them as methods if I can't get it to work soon.

abendebury commented 10 years ago

Yeah, that sounds fine.

keien commented 10 years ago

Fixed in f4d7de29aa2a1b1dc7881c411b03a4764487daf0, but unit tests are failing because StringProcessor and SequenceProcessor now require a project in their constructors. I tried and failed to update the unit tests to properly register the change, so @PlasmaSheep if you could it'd be appreciated.

On a side note, since the relevant calls (word.sentences, word.sequences) are no longer relationships, we can't treat them like lists anymore; I've updated testmodels to reflect this change (in any case, because the association objects have additional fields, we should never have been using the object.relationship = [items] and object.relationship.append(item) syntax anyway).

abendebury commented 10 years ago

Fixed everything except for the CollectionProcessor error. You removed the line that called counter, but presumably we still need to do that so I'll leave the unit test failing as a reminder.

keien commented 10 years ago

I changed the way counts are computed; sentence counts are now done on the fly (while Aditi did say that it was faster for her to count afterwards, because of our implementations it's actually much faster to count on the fly). I've yet to figure out how to do document counts more quickly, though.