BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments
0 stars 0 forks source link

Why are we getting comment indices from RC_Count_List during indexing for training and test sets? #27

Closed sabjoslo closed 6 years ago

sabjoslo commented 6 years ago

The variable indices created in Define_Sets and passed to Create_New_Sets is read from RC_Count_List (which is a list of counts per month IIRC):

ipdb> indices
[0, 0, 3, 11, 4, 17, 30, 8, 12, 14, 38, 15, 14, 28, 26, 27, 23, 34, 57, 68, 99, 95, 69, 94, 76, 92, 71, 79, 178, 185, 104, 166, 177, 366, 1394, 371]

However, it seems to be being treated as though it's a list of all comment indices. In Create_New_Sets it's used to determine the comment indices to sample the train and eval sets from:

67    num_comm = indices[-1]
68    indices = range(num_comm)
...
126    LDA_sets['eval'] = sample(indices,num_eval)
127    LDA_sets['train'] = set(indices).difference(set(LDA_sets['eval']))
BabakHemmatian commented 6 years ago

RC_Count_List was supposed to store the cumulative monthly counts. It still does in my version of the parser (it's based on a simple counter named processed_counter in the parser function). This is the first two years of the output I get: [0,0,3,14,18,35,65,73,85,99,137,152,166,194,220,247,270,304,361,429,528,623,692,786] In that case indices[-1] would be the total count of the comments in the dataset and line 68 would create a set of all those comments. I realize that a non-cumulative count would have been more intuitive, but this is what I came up with at the time and since the non-cumulative version is trivially retrievable from the cumulative version, I stuck with it. More importantly, a number of other functions also rely on RC_Count_List being the cumulative count. Including Yearly_Counts that is used for topic contribution calculation. My guess is that you rewrote the RC_Count_List production to the non-cumulative version while streamlining the parser and that's why you're getting this error. Hope this clarifies things! Let me know if I can help with anything :)

sabjoslo commented 6 years ago

That's really helpful, thanks! I'll see if I can figure out where/how the change was made. What is the last commit that's been integrated into your local repository (you can see the commit tree by typing git log)?

BabakHemmatian commented 6 years ago

I'm ashamed to say it's from before Utils.py was broken up into different files lol. I need to catch up

sabjoslo commented 6 years ago

I'm glad you're running an older version of the code--it means you're able to diagnose bugs like this one which it sounds like I introduced while messing around with Utils. What's the first commit hash that git log outputs when you run it from your local repo?

BabakHemmatian commented 6 years ago

The last commit to my code is 2e373db . I must have made uncommitted changes to the code afterwards though. I'm unfortunately a very disorganized coder. Working on that.

sabjoslo commented 6 years ago

Thanks! BTW you can see any changes you've made that haven't been added to your working tree by running git diff.