Closed sabjoslo closed 6 years ago
@BabakHemmatian, to do this we're going to need to have some way to identify which comments in the original_comm
file come from which year. What do you think about adding the year to the end of the filename (i.e. 'original_comm-{}'.format(year)
in line 251 of Utils.py)?
@sabjoslo That's totally doable, but might not be necessary. There's a function called Yearly_Counts
in Utils.py that calculates the number of relevant comments for each year and can be used for the sampling. It receives as input the monthly counts stored on disk when you run the parser and outputs both the regular and cumulative yearly relevant counts.
Do you mean the function Yearly_Counts
? I'm not understanding how that would help identify which year a given comment was from. How are you able to identify that?
Yeah, that's the function. You can use the cumulative count to determine which year you're sampling from. So for example if the count is 152 for 2006 and 786 for 2007, that means if you have a for loop going over original_comm
and you have a counter, the indices between 152 and 785 belong to the year 2007.
Oh, I see. Would you be able to send me your RC_Count_List file, since you said the parser takes a couple of days to run?
Add a function to Utils.py that writes to file
n
random comments fromwrite_original
files for a set of given years (whenyear==2008
, include data from 2006 and 2007, too).