Closed Andyccs closed 8 years ago
Currently, the recrawler does not always work correctly
[03/Apr/2016 08:01:31] "GET /recrawler-service/recrawl HTTP/1.0" 200 0
recrawling tweets from NBATV
crawling NBATV tweeter timeline
crawling page 1 of NBATV
recrawling tweets from TheNBACentral
crawling TheNBACentral tweeter timeline
crawling page 1 of TheNBACentral
recrawling tweets from espn
crawling espn tweeter timeline
crawling page 1 of espn
recrawling tweets from ESPNNBA
crawling ESPNNBA tweeter timeline
crawling page 1 of ESPNNBA
recrawling tweets from SimpleNBAScores
crawling SimpleNBAScores tweeter timeline
crawling page 1 of SimpleNBAScores
Exception in thread Thread-15:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/src/app/recrawler/recrawl/views.py", line 53, in background_process
classify_data.classify_data()
File "/usr/src/app/classifier/classify_data.py", line 16, in classify_data
with open('data/' + filename + '_data.json') as json_file:
IOError: [Errno 2] No such file or directory: 'data/espn_data.json'
Apparently, the recrawler tried to classify data first, and not waiting for the crawler to finish crawling.
The above situation happens most of the time, but I do get some successful recrawl before.
I am pretty sure it is not working now, since we add a new field called label to solr schema.