BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments
0 stars 0 forks source link

Sample training data for the LDA evenly across years #9

Closed sabjoslo closed 6 years ago

sabjoslo commented 6 years ago

Add a function to Utils.py that writes to file n random comments from write_original files for a set of given years (when year==2008, include data from 2006 and 2007, too).

sabjoslo commented 6 years ago

@BabakHemmatian, to do this we're going to need to have some way to identify which comments in the original_comm file come from which year. What do you think about adding the year to the end of the filename (i.e. 'original_comm-{}'.format(year) in line 251 of Utils.py)?

BabakHemmatian commented 6 years ago

@sabjoslo That's totally doable, but might not be necessary. There's a function called Yearly_Counts in Utils.py that calculates the number of relevant comments for each year and can be used for the sampling. It receives as input the monthly counts stored on disk when you run the parser and outputs both the regular and cumulative yearly relevant counts.

sabjoslo commented 6 years ago

Do you mean the function Yearly_Counts? I'm not understanding how that would help identify which year a given comment was from. How are you able to identify that?

BabakHemmatian commented 6 years ago

Yeah, that's the function. You can use the cumulative count to determine which year you're sampling from. So for example if the count is 152 for 2006 and 786 for 2007, that means if you have a for loop going over original_comm and you have a counter, the indices between 152 and 785 belong to the year 2007.

sabjoslo commented 6 years ago

Oh, I see. Would you be able to send me your RC_Count_List file, since you said the parser takes a couple of days to run?