Closed MichaelAquilina closed 10 years ago
Same thing seems to be happening with android at about ~580
post_extractor.py android ../Reddit-Training-Data --limit 600 --threads 40 --period all --filter top
I can't seem to figure out the cause of this freezing!
Same thing seems to happen with /r/bristol
/r/computerscience too
There seems to be a fundamental flaw in the code. Unfurtunately this is very very hard to identify given its threaded nature.
Changing the number of threads doesnt not seem to effect this (--threads 4
tested)
Check out Reddit-Testing-Data/subreddits
there are HUGE bunch of subreddit files. The issue is probably stemming from there!
You should probably change the subreddit .json files so that only useful information is kept.
Problem still seems to be occurring. There seems to be a more fundamental issue with the way data is being downloaded.
Test Case /r/atheism
Test Case:
The tool gets stuck at round ~450 downloading a massive file. This prevents the producer (main) thread from pushing any further data down the pipeline and causes the tool to become stuck. There are some WIP in the
alternative-post-extractor
branch that attempt to detect when a file is too large but this does not seem to solve the issue yet. There could be a more fundamental problem (possibly even within the requests module).