Issue with preprocessing background term

leo-liuzy commented 5 years ago

At line 90, the calculation of background frequency seems wrong. Since background freq is essentially the probability of words (unigram language model probability), the correct expression should be master.toarray().sum(0)/master.toarray().sum().

Here is what I found with ipdb when run the model on AG dataset

 /data/zeyuliu2/vampire/scripts/preprocess_data.py(93)main()
     92     ipdb.set_trace()
---> 93     print(f"all_data shape: {all_data.shape}")
     94     # generate background frequency

ipdb> l                                                                                                                      
     88     vectorized_dev_examples = sparse.hstack((np.array([0] * len(tokenized_dev_examples))[:,None], vectorized_dev_examples))
     89     master = sparse.vstack([vectorized_train_examples, vectorized_dev_examples])
     90     all_data = master.toarray()
     91     import ipdb
     92     ipdb.set_trace()
---> 93     print(f"all_data shape: {all_data.shape}")
     94     # generate background frequency
     95     print("generating background frequency...")
     96     bgfreq = dict(zip(count_vectorizer.get_feature_names(), all_data.sum(1) / args.vocab_size))
     97 
     98     print("saving data...")

ipdb> all_data.shape                                                                                                         
(120000, 30001)
ipdb> all_data.sum(1).shape                                                                                                  
(120000,)
ipdb> all_data.sum(0).shape                                                                                                  
(30001,)

No error message when run on AG dataset because num_doc > num_vocab. But with my smaller dataset, where num_doc < 3000, there is error (dimension mismatch during training).

I fix the bug; the fix is exactly what I suggested --- master.toarray().sum(0)/master.toarray().sum()

dangitstam commented 5 years ago

Thanks for reporting this! Change has been made in our current experimental branch, I'll close this issue once it's in master.

kernelmachine commented 5 years ago

This has been merged! Thanks for the catch @leo-liuzy .

allenai / vampire

Issue with preprocessing background term #40