Implement batch mode via the CLI.

allenai / ike

Build tables of information by extracting facts from indexed text corpora via a simple and effective query language.

http://allenai.org/software/interactive-knowledge-extraction/

Apache License 2.0

56 stars 20 forks source link

Implement batch mode via the CLI. #224

Closed ckarenz closed 8 years ago

ckarenz commented 8 years ago

Run multiple queries simultaneously, outputting the results to separate table files.
Index arbitrary sentence corpora as opposed to the existing pre-processed set.
Persist new corpus index between multiple executions.

ckarenz commented 8 years ago

Adding @sbhaktha, @schmmd, @rodneykinney as FYI.

rodneykinney commented 8 years ago

Rock on!

dirkgr commented 8 years ago

Tests failed?

ckarenz commented 8 years ago

The tests failed on code formatting, but the whole project suffers from formatting issues. The next change will be pretty noisy (pure formatting), but can safely be ignored.

dirkgr commented 8 years ago

The main concern I have with this is that it doesn't know whether it wants to run in Spark or locally. If it's local-only, it needs no Spark at all. Everything could be done with par. If it needs to be distributed, it needs some work.

ckarenz commented 8 years ago

I agree with the principle. Our line of thinking follows:

Spark gets us free parallel data parsing from file or S3, without loading it all into memory.
We probably want to distribute this for larger corpora. Even a few million sentences takes a while. Starting with Spark on local makes for a relatively easy transition to clusters in a later change.

I can remove the Spark, but it will probably be a significant rework.

rodneykinney commented 8 years ago

Using .par is not enough because there is also a groupBy that causes an out-of-memory error in some cases.

dirkgr commented 8 years ago

Using .par is not enough because there is also a groupBy that causes an out-of-memory error in some cases.

You can give it more memory. Efficiency wise, it's competing with a whole spark cluster, right? That's a lot of memory.

To put it in Spark terms, you're doing some fairly heavy lifting on the driver, which you aren't supposed to do, according to Spark doctrine.

I realize it's hard to do a switch in either direction. If you want to merge anyways, I guess that's fine. Also, I can see how you would end up with this solution during a hackathon. But for production, that's a pretty big design problem.

rodneykinney commented 8 years ago

This isn't running on a cluster; it's running in local mode. I think it's a fair use of Spark to do grouping, sorting, and aggregation on a single machine for datasets that don't fit in memory.

ckarenz commented 8 years ago

I can't get these tests to pass, and it looks to be an issue with Semaphore. I've even tried commenting out all the tests in the failing TestQuerySuggester and Semaphore still reports that test as "failed". It sounds like we're pretty much out of ideas at this point, so I'm going to give up on this PR.

@sbhaktha: I've pushed all these changes to the hackathon branch so we at least have them somewhere. Can you take a look when you're back?