Open organisciak opened 9 years ago
This is already occurring--wordlist.txt is limited, but with a cutoff of the top 1 million tokens so maybe it's not being encountered in this set yet?
— Sent from Mailbox
On Sat, Mar 7, 2015 at 10:15 PM, Peter Organisciak notifications@github.com wrote:
Working with the HTRC data, which is OCR'd from book scans, a sizeable portion of the wordlist is simply OCR errors. While some errors are meaningful (e.g. estimate the usage of the medial S), most of that index is pointless. On my test data, 349k terms (91%) occur <10 times.
$ wc -l files/texts/wordlist/wordlist.txt 384163 files/texts/wordlist/wordlist.txt $ grep -Pc "\t\d$" files/texts/wordlist/wordlist.txt 349418
- Is it worth having a feature to truncate the far afield long-tail in the wordlist? What adverse effects would there be? I can imagine losing a unique word that only occurs in a particular volume (Golgafrinchan), but I'm not sure what the bookworm use case would be where that word would be missed.
2. How difficult would it be to do so? Would this require any grepping, or is it simply a case of editing wordlist.txt (i.e. does indexing ignore anything not in wordlist.txt)?
Reply to this email directly or view it on GitHub: https://github.com/Bookworm-project/BookwormDB/issues/57
Ah, I had no idea. So anything that's not in wordlist.txt is ignored then?
1 million is a huge cutoff, I might trim it manually based on corpus occurrences (anything that shows up <1 time overall). The problem, of course, is that those OCR errors are still tokenized properly --presumably-- so they contribute to more realistic total word counts. Not that I expect it would be problematic to count "unknown" words when they're not in the word list.
Apologies if this posts twice.
On a phone, but briefly:
Closing--feel free to re-open if there are issues I'm missing here.
I'm reopening b/c although this happens for the standard ingest form, it doesn't necessarily happen for the new token ingest form. In the Makefile, that's now done after the fact on the results of the fast_featurecount script with a simple head. But (@organisciak) you may want to handle that differently.
I think that's a viable way to trim it. Though it might be cleaner to head -n when using the wordlist, that would need to be done in multiple places, so I prefer your way.
On Fri, Mar 20, 2015 at 2:57 PM Benjamin Schmidt notifications@github.com wrote:
I'm reopening b/c although this happens for the standard ingest form, it doesn't necessarily happen for the new token ingest form. In the Makefile, that's now done after the fact on the results of the fast_featurecount script with a simple head https://github.com/Bookworm-project/BookwormDB/commit/aec6c6748303c9a2163727695aa9671d3cdc853c. But (@organisciak https://github.com/organisciak) you may want to handle that differently.
— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/57#issuecomment-84164876 .
Working with the HTRC data, which is OCR'd from book scans, a sizeable portion of the wordlist is simply OCR errors. While some errors are meaningful (e.g. estimate the usage of the medial S), most of that index is pointless. On my test data, 349k terms (91%) occur <10 times.