commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Spark-submit uses only 1 core. #46

Closed IvRRimum closed 8 years ago

IvRRimum commented 8 years ago

So i recently deployed the project on aws and i was supriced by the low performance of indexer. I investigated and found out that spark indexer only uses 1 core of all available( on 100% ), why is that ?

I can't seem to figure out the way to fix this, any ideas ?

IvRRimum commented 8 years ago

It turns out its python :O

Need multithreading

http://stackoverflow.com/questions/5784389/using-100-of-all-cores-with-python-multiprocessing

sylvinus commented 8 years ago

Spark should be able to run multiple processes on the same machine, there must be an issue in the configuration

sylvinus commented 8 years ago

@IvRRimum are you indexing multiple Common Crawl files?

spark-submit jobs/spark/index.py --source commoncrawl:limit=1 will only use 1 core because it will generate only 1 Spark task.

However I can confirm that with the local Docker image, spark-submit jobs/spark/index.py --source commoncrawl:limit=4 does use more than 1 core at once.

If you AWS deployment is only using 1 core, I'd be curious to know more about the Spark config you're using. I'm planning on testing https://github.com/nchammas/flintrock and possibly integrate it into cosr-ops to make deployment easy for everyone.

sylvinus commented 8 years ago

Closing this until further confirmation, feel free to reopen if necessary.