MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

post_extractor gets stuck downloading physics data #47

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Test Case:

post_extractor.py physics ../Reddit-Training-Data --limit 600 --threads 40 --period all --filter top

The tool gets stuck at round ~450 downloading a massive file. This prevents the producer (main) thread from pushing any further data down the pipeline and causes the tool to become stuck. There are some WIP in the alternative-post-extractor branch that attempt to detect when a file is too large but this does not seem to solve the issue yet. There could be a more fundamental problem (possibly even within the requests module).

MichaelAquilina commented 10 years ago

Same thing seems to be happening with android at about ~580

post_extractor.py android ../Reddit-Training-Data --limit 600 --threads 40 --period all --filter top
MichaelAquilina commented 10 years ago

I can't seem to figure out the cause of this freezing!

MichaelAquilina commented 10 years ago

Same thing seems to happen with /r/bristol

MichaelAquilina commented 10 years ago

/r/computerscience too

There seems to be a fundamental flaw in the code. Unfurtunately this is very very hard to identify given its threaded nature.

MichaelAquilina commented 10 years ago

Changing the number of threads doesnt not seem to effect this (--threads 4 tested)

MichaelAquilina commented 10 years ago

Check out Reddit-Testing-Data/subreddits there are HUGE bunch of subreddit files. The issue is probably stemming from there!

MichaelAquilina commented 10 years ago

You should probably change the subreddit .json files so that only useful information is kept.

MichaelAquilina commented 10 years ago

Problem still seems to be occurring. There seems to be a more fundamental issue with the way data is being downloaded.

Test Case /r/atheism