Closed JonathanHourany closed 6 years ago
Thanks, @JonathanHourany . We'll investigate this in vendoring/distribution and add some unit tests to cover going forward.
Hello @JonathanHourany , we have identified the issue and have fixes pending for the next release. Would you like us to ship you updated copies in the meantime?
@mjbommar Yes, please, that would be great!
I am also interested in this if possible! Many thanks, Dom
@JonathanHourany - Please email us at support@lexpredict.com so we can get you the files.
Thanks, Eric Detterman eric@lexpredict.com
@DomHudson Please email us at support@lexpredict.com so we can get you the files.
Thanks, Eric Detterman eric@lexpredict.com
@ericlex - Will do. Thanks!
Hi @JonathanHourany and @DomHudson , we've created a hotfix branch and uploaded a tarball with the new files coming in 0.1.9: https://github.com/LexPredict/lexpredict-lexnlp/blob/0.1.8-hotfix-stopwords-collocations/lexnlp/nlp/en/stopwords_collocations_hotfix_0.1.8.tar.gz
You'll note that we've changed the collocation sizes, in addition to fixing the underlying issue with their automation.
You'll also note that we now provide multiple stopword files. The proportion in the file name corresponds to the cumulative token frequency represented by the stopwords; e.g., stopwords_0.5.pickle
corresponds to the top N frequency words s.t. 50% of all token occurrences are represented by these stopwords.
These are still preliminary for public use and we are working on releasing the underlying data generation/models in the near future. For example, you may or may not want numeric or hyphenated tokens in your stopwords. You may have seen our release of OpenEDGAR (https://github.com/LexPredict/openedgar) last week, which will be a dependency for customizing these
Many thanks!
It appears that all
collocation_bigram_*.pickle
files are the same; they are smaller than reported and all contain the same exact list.There's a similar issue with trigrams