Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Ignoring low-occurrence terms #57

Open organisciak opened 9 years ago

organisciak commented 9 years ago

Working with the HTRC data, which is OCR'd from book scans, a sizeable portion of the wordlist is simply OCR errors. While some errors are meaningful (e.g. estimate the usage of the medial S), most of that index is pointless. On my test data, 349k terms (91%) occur <10 times.

$ wc -l files/texts/wordlist/wordlist.txt
384163 files/texts/wordlist/wordlist.txt
$ grep -Pc "\t\d$" files/texts/wordlist/wordlist.txt
349418
  1. Is it worth having a feature to truncate the far afield long-tail in the wordlist? What adverse effects would there be? I can imagine losing a unique word that only occurs in a particular volume (Golgafrinchan), but I'm not sure what the bookworm use case would be where that word would be missed.
  2. How difficult would it be to do so? Would this require any grepping, or is it simply a case of editing wordlist.txt (i.e. does indexing ignore anything not in wordlist.txt)?
bmschmidt commented 9 years ago

This is already occurring--wordlist.txt is limited, but with a cutoff of the top 1 million tokens so maybe it's not being encountered in this set yet?

— Sent from Mailbox

On Sat, Mar 7, 2015 at 10:15 PM, Peter Organisciak notifications@github.com wrote:

Working with the HTRC data, which is OCR'd from book scans, a sizeable portion of the wordlist is simply OCR errors. While some errors are meaningful (e.g. estimate the usage of the medial S), most of that index is pointless. On my test data, 349k terms (91%) occur <10 times.

$ wc -l files/texts/wordlist/wordlist.txt
384163 files/texts/wordlist/wordlist.txt
$ grep -Pc "\t\d$" files/texts/wordlist/wordlist.txt
349418
  1. Is it worth having a feature to truncate the far afield long-tail in the wordlist? What adverse effects would there be? I can imagine losing a unique word that only occurs in a particular volume (Golgafrinchan), but I'm not sure what the bookworm use case would be where that word would be missed.

2. How difficult would it be to do so? Would this require any grepping, or is it simply a case of editing wordlist.txt (i.e. does indexing ignore anything not in wordlist.txt)?

Reply to this email directly or view it on GitHub: https://github.com/Bookworm-project/BookwormDB/issues/57

organisciak commented 9 years ago

Ah, I had no idea. So anything that's not in wordlist.txt is ignored then?

1 million is a huge cutoff, I might trim it manually based on corpus occurrences (anything that shows up <1 time overall). The problem, of course, is that those OCR errors are still tokenized properly --presumably-- so they contribute to more realistic total word counts. Not that I expect it would be problematic to count "unknown" words when they're not in the word list.

bmschmidt commented 9 years ago

Apologies if this posts twice.

On a phone, but briefly:

  1. Anything not in wordlist.txt is ignored.
  2. Wordlist.txt is ordinarily created from the corpus, but it is possible to use one from another corpus (or a sample of the full corpus) for faster creation. Bookworm tokenization is not completely aligned with 2009 Google Ngrams tokenization, but for Hathi a part of the preparation definitely might be using an established wordlist.
  3. Google Ngrams contains about 9 million 1-grams; 1 million is pretty extensive (I'd be willing for my own purposes to usually accept about the top 200,000, which I blogged about years ago--for a very heterogeneous corpus, it's nice though to have a broad cutoff to catch technical terms and rare last names.)
  4. On the movie bookworm, some of the most frequent search terms are one-time-only words--cromulent, darmok, embiggens. And on very small bookworms (the federalist papers) it can be important to be able to search for single occurrences.
bmschmidt commented 9 years ago

Closing--feel free to re-open if there are issues I'm missing here.

bmschmidt commented 9 years ago

I'm reopening b/c although this happens for the standard ingest form, it doesn't necessarily happen for the new token ingest form. In the Makefile, that's now done after the fact on the results of the fast_featurecount script with a simple head. But (@organisciak) you may want to handle that differently.

organisciak commented 9 years ago

I think that's a viable way to trim it. Though it might be cleaner to head -n when using the wordlist, that would need to be done in multiple places, so I prefer your way.

On Fri, Mar 20, 2015 at 2:57 PM Benjamin Schmidt notifications@github.com wrote:

I'm reopening b/c although this happens for the standard ingest form, it doesn't necessarily happen for the new token ingest form. In the Makefile, that's now done after the fact on the results of the fast_featurecount script with a simple head https://github.com/Bookworm-project/BookwormDB/commit/aec6c6748303c9a2163727695aa9671d3cdc853c. But (@organisciak https://github.com/organisciak) you may want to handle that differently.

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/57#issuecomment-84164876 .