Bigram and trigram collection files are not as listed

LexPredict / lexpredict-lexnlp

LexNLP by LexPredict

GNU Affero General Public License v3.0

701 stars 178 forks source link

Bigram and trigram collection files are not as listed #5

Closed JonathanHourany closed 6 years ago

JonathanHourany commented 6 years ago

It appears that all collocation_bigram_*.pickle files are the same; they are smaller than reported and all contain the same exact list.

In [1]: import pickle
In [2]: BIGRAM_COLLOCATIONS_100 = pickle.load(open("collocation_bigrams_100.pickle", 'rb'))
In [3]: BIGRAM_COLLOCATIONS_1000 = pickle.load(open("collocation_bigrams_1000.pickle", 'rb'))
In [4]: BIGRAM_COLLOCATIONS_10000 = pickle.load(open("collocation_bigrams_10000.pickle", 'rb'))
In [5]: len(BIGRAM_COLLOCATIONS_100)
Out[5]: 46

In [6]: len(BIGRAM_COLLOCATIONS_1000)
Out[6]: 46

In [7]: len(BIGRAM_COLLOCATIONS_10000)
Out[7]: 46

There's a similar issue with trigrams

In [10]: TRIGRAM_COLLOCATIONS_1000 = pickle.load(open("collocation_trigrams_1000.pickle", 'rb'))
In [11]: TRIGRAM_COLLOCATIONS_10000 = pickle.load(open("collocation_trigrams_10000.pickle", 'rb'))
In [12]: len(TRIGRAM_COLLOCATIONS_100)
Out[12]: 100

In [13]: len(TRIGRAM_COLLOCATIONS_1000)
Out[13]: 431

In [14]: len(TRIGRAM_COLLOCATIONS_10000)
Out[14]: 431

mjbommar commented 6 years ago

Thanks, @JonathanHourany . We'll investigate this in vendoring/distribution and add some unit tests to cover going forward.

mjbommar commented 6 years ago

Hello @JonathanHourany , we have identified the issue and have fixes pending for the next release. Would you like us to ship you updated copies in the meantime?

JonathanHourany commented 6 years ago

@mjbommar Yes, please, that would be great!

DomHudson commented 6 years ago

I am also interested in this if possible! Many thanks, Dom

ericlex commented 6 years ago

@JonathanHourany - Please email us at support@lexpredict.com so we can get you the files.

Thanks, Eric Detterman eric@lexpredict.com

ericlex commented 6 years ago

@DomHudson Please email us at support@lexpredict.com so we can get you the files.

Thanks, Eric Detterman eric@lexpredict.com

JonathanHourany commented 6 years ago

@ericlex - Will do. Thanks!

mjbommar commented 6 years ago

Hi @JonathanHourany and @DomHudson , we've created a hotfix branch and uploaded a tarball with the new files coming in 0.1.9: https://github.com/LexPredict/lexpredict-lexnlp/blob/0.1.8-hotfix-stopwords-collocations/lexnlp/nlp/en/stopwords_collocations_hotfix_0.1.8.tar.gz

You'll note that we've changed the collocation sizes, in addition to fixing the underlying issue with their automation.
You'll also note that we now provide multiple stopword files. The proportion in the file name corresponds to the cumulative token frequency represented by the stopwords; e.g., stopwords_0.5.pickle corresponds to the top N frequency words s.t. 50% of all token occurrences are represented by these stopwords.
These are still preliminary for public use and we are working on releasing the underlying data generation/models in the near future. For example, you may or may not want numeric or hyphenated tokens in your stopwords. You may have seen our release of OpenEDGAR (https://github.com/LexPredict/openedgar) last week, which will be a dependency for customizing these

DomHudson commented 6 years ago

Many thanks!