Closed sabjoslo closed 6 years ago
RC_Count_List
was supposed to store the cumulative monthly counts. It still does in my version of the parser (it's based on a simple counter named processed_counter
in the parser function). This is the first two years of the output I get:
[0,0,3,14,18,35,65,73,85,99,137,152,166,194,220,247,270,304,361,429,528,623,692,786]
In that case indices[-1] would be the total count of the comments in the dataset and line 68 would create a set of all those comments.
I realize that a non-cumulative count would have been more intuitive, but this is what I came up with at the time and since the non-cumulative version is trivially retrievable from the cumulative version, I stuck with it. More importantly, a number of other functions also rely on RC_Count_List
being the cumulative count. Including Yearly_Counts
that is used for topic contribution calculation.
My guess is that you rewrote the RC_Count_List
production to the non-cumulative version while streamlining the parser and that's why you're getting this error. Hope this clarifies things! Let me know if I can help with anything :)
That's really helpful, thanks! I'll see if I can figure out where/how the change was made. What is the last commit that's been integrated into your local repository (you can see the commit tree by typing git log
)?
I'm ashamed to say it's from before Utils.py
was broken up into different files lol. I need to catch up
I'm glad you're running an older version of the code--it means you're able to diagnose bugs like this one which it sounds like I introduced while messing around with Utils. What's the first commit hash that git log
outputs when you run it from your local repo?
The last commit to my code is 2e373db . I must have made uncommitted changes to the code afterwards though. I'm unfortunately a very disorganized coder. Working on that.
Thanks! BTW you can see any changes you've made that haven't been added to your working tree by running git diff
.
The variable
indices
created inDefine_Sets
and passed toCreate_New_Sets
is read fromRC_Count_List
(which is a list of counts per month IIRC):However, it seems to be being treated as though it's a list of all comment indices. In
Create_New_Sets
it's used to determine the comment indices to sample the train and eval sets from: