commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Error running index job #37

Closed chaconnewu closed 8 years ago

chaconnewu commented 8 years ago

When I run spark-submit jobs/spark/index.py --warc_limit 1 --only_homepages --profile

as described in README.md, the follow error will appear:

16/03/15 07:15:18 INFO BlockManagerMaster: Registered BlockManager Traceback (most recent call last): File "/cosr/back/jobs/spark/index.py", line 174, in spark_main() File "/cosr/back/jobs/spark/index.py", line 145, in spark_main warc_filenames = list_warc_filenames() File "/cosr/back/jobs/spark/index.py", line 72, in list_warc_filenames warc_files = list_commoncrawl_warc_filenames(limit=args.warc_limit, skip=args.warc_skip) File "/cosr/back/cosrlib/webarchive.py", line 22, in list_commoncrawl_warc_filenames with open(warc_paths, "r") as f: IOError: [Errno 2] No such file or directory: '/cosr/back/local-data/common-crawl/warc.paths.txt'

sylvinus commented 8 years ago

@chaconnewu sorry about that! there's a step missing in the docs:

./scripts/import_commoncrawl.sh 0

Can you confirm it fixes the issue?

chaconnewu commented 8 years ago

Thanks @sylvinus! I confirm it fixes the issue. However, another error will generate after the job is running for a while:

Caught Python exception in generator! Traceback (most recent call last): File "/cosr/back/cosrlib/utils.py", line 26, in wrapped for x in fn(_args, *_kwargs): File "/cosr/back/jobs/spark/index.py", line 87, in iter_records warc_file = open_warc_file(filename, from_commoncrawl=(not args.warc_files)) File "cosrlib/webarchive.py", line 39, in open_warc_file pds = conn.get_bucket('aws-publicdatasets') File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 503, in get_bucket return self.head_bucket(bucket_name, headers=headers) File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 522, in head_bucket response = self.make_request('HEAD', bucket_name, headers=headers) File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 665, in make_request retry_handler=retry_handler File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1071, in make_request retry_handler=retry_handler) File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1030, in _mexe raise ex gaierror: [Errno -2] Name or service not known

sylvinus commented 8 years ago

I've pushed a tentative fix: https://github.com/commonsearch/cosr-back/commit/30f7aff9252a96dd34e7d2a8d5fdc5acd56687e4

Can you pull and see if it works?

chaconnewu commented 8 years ago

Yes, it worked!

This is the output after indexing: 16/03/16 03:06:57 INFO PythonRunner: Times: total = 437675, boot = 144, init = 329, finish = 437202 16/03/16 03:06:57 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 118012 bytes result sent to driver 16/03/16 03:06:57 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 437930 ms on localhost (1/1) 16/03/16 03:06:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/03/16 03:06:57 INFO DAGScheduler: ResultStage 0 (count at /cosr/back/jobs/spark/index.py:165) finished in 437.994 s 16/03/16 03:06:57 INFO DAGScheduler: Job 0 finished: count at /cosr/back/jobs/spark/index.py:165, took 438.235379 s Indexed 695 WARC records

sylvinus commented 8 years ago

Awesome :)