BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments
0 stars 0 forks source link

Raw comment files contain files from other months #24

Closed sabjoslo closed 5 years ago

sabjoslo commented 6 years ago

Just posting this for the sake of documentation: Sometimes raw comment files contain comments from other months, e.g. the file for 2007-12 contains two (relevant) comments from 2007-11. This explains discrepancies between RC_Count_List and RC_Count_Dict.

BabakHemmatian commented 6 years ago

Yeah, I noticed that when I was trying to piece together parsed data from four different computers. For the same 2 comments actually. It seemed very uncommon to me though. Do you have numbers to gauge the seriousness of the discrepancy?

sabjoslo commented 6 years ago

Here are counts of mismatches for 2006--2008 (am only including non-zero counts):

2007 2: 1 2007 3: 1 2007 7: 1 2007 11: 2 2007 12: 2 2008 6: 1 2008 9: 2 2008 11: 5 2008 12: 2

BabakHemmatian commented 6 years ago

Thank you so much for the counts! So we have 17 mismatches in around 4000 comments. 0.004 of the documents. This certainly has to be reported. I don't think it poses a huge problem for our analysis though. What do you think? If it is problematic, then I could help us restructure the code so that it doesn't use the separate files as the indicators of month, but rather the timestamps on the comments

sabjoslo commented 6 years ago

I don't feel like it's terribly problematic, but since it seems relatively trivial to make that change, I guess I feel like we should. What do you think?

BabakHemmatian commented 6 years ago

Yeah, the change to the code is trivial. We don't even need to parse the code again. Just write a single loop to iterate through the comments based on the filter and create a new RC_Count_List. I can take care of that