When should we filter posts that mention gay marriage?

BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments

0 stars 0 forks source link

When should we filter posts that mention gay marriage? #2

Closed sabjoslo closed 7 years ago

sabjoslo commented 7 years ago

From what I understand you're working on something that:

Reads JSON data from a file,
Creates a Python object that stores the line of the file that match a regex, and
Then lemmatizes/analyzes the Python object. I propose filtering posts before saving the response from the API to a file. To be more specific, we could filter the posts that get passed from json_obj (JSON-formatted API response) to data (JSON data to be written to a file) in lines 42-43 of get_all_posts.py.

Does that make sense? If so, I can definitely implement that. Also, did you tell me that you've already done some work writing the regex we'll use to filter posts?

BabakHemmatian commented 7 years ago

I did the same thing you suggested in the last version of the code. Please see the pull request.

sabjoslo commented 7 years ago

What files/lines is the code for regex matching in?

BabakHemmatian commented 7 years ago

26 and 40 in parsereddit.py

sabjoslo commented 7 years ago

So what it looks like you're doing now is:

Take a list of comments saved to file and filter by regex.
Parse/analyze/etc. the remaining comments.

My suggestion would be a change in the step that creates the data files. I guess I'm actually making two suggestions: what is filtered and where it is filtered. What: Filter posts instead of comments. E.g. if a post matches the regex, analyze all the comments on that post. Where: Filter before saving anything to file, e.g. before the creation of a file like RC_2005-12.bz2. I would take the regex you've written, and use it to select which of the API's responses we want to save.

In the end, the flow would look something like:

Query for posts.
Filter posts by regex.
Use the matching posts to query comments.
Save comments to file.
Parse/analyze, etc.

Let me know your thoughts and if any of that isn't clear.

BabakHemmatian commented 7 years ago

Right. Sorry, I didn't read your original comment carefully enough. I'm not thick, believe me lol

Uriel said that the reddit data is organized by message, which means either giving up on filtering posts instead of comments, or creating our own API that might be very time-consuming. Considering how well the topics turned out even given a limited corpus and less-than-perfect setup, I don't think it's worth it. But you said the Facebook API allows for separating posts. Maybe we should have lines in our code that filter posts, but comment them out when running the code on the Reddit corpus.
I was thinking about filtering the corpus before saving anything to file last night. I'm not comfortable with codes that refer to online objects though. Could you please implement these? In the meantime, I'll focus on implementing Uriel's suggestions about the stopword list and the LDA.

sabjoslo commented 7 years ago

OK, I'll write and post something that for 9 news sources does something like:

for SOURCE in SOURCES:
    POSTS=<all the posts made by SOURCE that **match the regex**>
    for POST in POSTS:
       COMMENTS=<all comments on POST>
        saveToFile(COMMENTS)