Closed ckarenz closed 8 years ago
Adding @sbhaktha, @schmmd, @rodneykinney as FYI.
Rock on!
Tests failed?
The tests failed on code formatting, but the whole project suffers from formatting issues. The next change will be pretty noisy (pure formatting), but can safely be ignored.
The main concern I have with this is that it doesn't know whether it wants to run in Spark or locally. If it's local-only, it needs no Spark at all. Everything could be done with par
. If it needs to be distributed, it needs some work.
I agree with the principle. Our line of thinking follows:
I can remove the Spark, but it will probably be a significant rework.
Using .par
is not enough because there is also a groupBy
that causes an out-of-memory error in some cases.
Using
.par
is not enough because there is also agroupBy
that causes an out-of-memory error in some cases.
You can give it more memory. Efficiency wise, it's competing with a whole spark cluster, right? That's a lot of memory.
To put it in Spark terms, you're doing some fairly heavy lifting on the driver, which you aren't supposed to do, according to Spark doctrine.
I realize it's hard to do a switch in either direction. If you want to merge anyways, I guess that's fine. Also, I can see how you would end up with this solution during a hackathon. But for production, that's a pretty big design problem.
This isn't running on a cluster; it's running in local mode. I think it's a fair use of Spark to do grouping, sorting, and aggregation on a single machine for datasets that don't fit in memory.
I can't get these tests to pass, and it looks to be an issue with Semaphore. I've even tried commenting out all the tests in the failing TestQuerySuggester
and Semaphore still reports that test as "failed". It sounds like we're pretty much out of ideas at this point, so I'm going to give up on this PR.
@sbhaktha: I've pushed all these changes to the hackathon
branch so we at least have them somewhere. Can you take a look when you're back?