Closed sabjoslo closed 7 years ago
I did the same thing you suggested in the last version of the code. Please see the pull request.
What files/lines is the code for regex matching in?
26 and 40 in parsereddit.py
So what it looks like you're doing now is:
My suggestion would be a change in the step that creates the data files. I guess I'm actually making two suggestions: what is filtered and where it is filtered. What: Filter posts instead of comments. E.g. if a post matches the regex, analyze all the comments on that post. Where: Filter before saving anything to file, e.g. before the creation of a file like RC_2005-12.bz2. I would take the regex you've written, and use it to select which of the API's responses we want to save.
In the end, the flow would look something like:
Let me know your thoughts and if any of that isn't clear.
Right. Sorry, I didn't read your original comment carefully enough. I'm not thick, believe me lol
OK, I'll write and post something that for 9 news sources does something like:
for SOURCE in SOURCES:
POSTS=<all the posts made by SOURCE that **match the regex**>
for POST in POSTS:
COMMENTS=<all comments on POST>
saveToFile(COMMENTS)
From what I understand you're working on something that:
json_obj
(JSON-formatted API response) todata
(JSON data to be written to a file) in lines 42-43 ofget_all_posts.py
.Does that make sense? If so, I can definitely implement that. Also, did you tell me that you've already done some work writing the regex we'll use to filter posts?