Andyccs / sport-news-retrieval

MIT License
6 stars 2 forks source link

Update recrawler #10

Closed Andyccs closed 8 years ago

Andyccs commented 8 years ago

I am pretty sure it is not working now, since we add a new field called label to solr schema.

Andyccs commented 8 years ago

Currently, the recrawler does not always work correctly

[03/Apr/2016 08:01:31] "GET /recrawler-service/recrawl HTTP/1.0" 200 0
recrawling tweets from NBATV
crawling NBATV tweeter timeline
crawling page 1 of NBATV
recrawling tweets from TheNBACentral
crawling TheNBACentral tweeter timeline
crawling page 1 of TheNBACentral
recrawling tweets from espn
crawling espn tweeter timeline
crawling page 1 of espn
recrawling tweets from ESPNNBA
crawling ESPNNBA tweeter timeline
crawling page 1 of ESPNNBA
recrawling tweets from SimpleNBAScores
crawling SimpleNBAScores tweeter timeline
crawling page 1 of SimpleNBAScores
Exception in thread Thread-15:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/src/app/recrawler/recrawl/views.py", line 53, in background_process
    classify_data.classify_data()
  File "/usr/src/app/classifier/classify_data.py", line 16, in classify_data
    with open('data/' + filename + '_data.json') as json_file:
IOError: [Errno 2] No such file or directory: 'data/espn_data.json'

Apparently, the recrawler tried to classify data first, and not waiting for the crawler to finish crawling.

Andyccs commented 8 years ago

The above situation happens most of the time, but I do get some successful recrawl before.