BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments
0 stars 0 forks source link

When should we filter posts that mention gay marriage? #2

Closed sabjoslo closed 7 years ago

sabjoslo commented 7 years ago

From what I understand you're working on something that:

  1. Reads JSON data from a file,
  2. Creates a Python object that stores the line of the file that match a regex, and
  3. Then lemmatizes/analyzes the Python object. I propose filtering posts before saving the response from the API to a file. To be more specific, we could filter the posts that get passed from json_obj (JSON-formatted API response) to data (JSON data to be written to a file) in lines 42-43 of get_all_posts.py.

Does that make sense? If so, I can definitely implement that. Also, did you tell me that you've already done some work writing the regex we'll use to filter posts?

BabakHemmatian commented 7 years ago

I did the same thing you suggested in the last version of the code. Please see the pull request.

sabjoslo commented 7 years ago

What files/lines is the code for regex matching in?

BabakHemmatian commented 7 years ago

26 and 40 in parsereddit.py

sabjoslo commented 7 years ago

So what it looks like you're doing now is:

My suggestion would be a change in the step that creates the data files. I guess I'm actually making two suggestions: what is filtered and where it is filtered. What: Filter posts instead of comments. E.g. if a post matches the regex, analyze all the comments on that post. Where: Filter before saving anything to file, e.g. before the creation of a file like RC_2005-12.bz2. I would take the regex you've written, and use it to select which of the API's responses we want to save.

In the end, the flow would look something like:

Let me know your thoughts and if any of that isn't clear.

BabakHemmatian commented 7 years ago

Right. Sorry, I didn't read your original comment carefully enough. I'm not thick, believe me lol

sabjoslo commented 7 years ago

OK, I'll write and post something that for 9 news sources does something like:

for SOURCE in SOURCES:
    POSTS=<all the posts made by SOURCE that **match the regex**>
    for POST in POSTS:
       COMMENTS=<all comments on POST>
        saveToFile(COMMENTS)