Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Allow ingest from feature count files #48

Open bmschmidt opened 9 years ago

bmschmidt commented 9 years ago

In addition to ingesting from raw text, we should be able to ingest from feature count files. Some examples include:

  1. Jstor data for research files (I've created a repo to work with them here).
  2. Google Ngrams counts files.
  3. The internal Hathi feature count files
  4. Things like Ted Underwood's page-level feature token-counts.

My last commit started this process. One issue is that there is no standard syntax for a "feature count" file, so we need to create one. I'm assuming, for the time being,

  1. named pipes in the directory one above the bookworm installation in a
  2. tsv format with columns "documentid,feature,count".

That's a little wonky, but it should be easy for any specific codebase to use awk or whatever to transform the original format into this.

One interesting side-effect is that features need not be "tokens," just some arbitrary strings with any possible meaning. But that will require more granular control over whether you're building unigrams, bigrams, trigrams, etc; most generic feature counts only make sense as unigrams.

organisciak commented 9 years ago

Cool. I started this process with the HTRC extracted feature counts. I processed the input, but haven't tried slurping into the DB yet.

Looking through the code, it looks ready to toy with; is this correct? The one part I'm not clear about is line 242 of tokenizer.py: why does nothing happen if the documentid has been seen? Don't we want to have documentid+feature?

bmschmidt commented 9 years ago

Oh yeah, that's almost certainly wrong. Misguided analogy to the earlier format.

I think this could be safely removed altogether for the time being--the intent in the other code is allow you to run make on a Bookworm and not waste time re-encoding files.

That would be useful here, too, and we could use a feature request to reintroduce it. But it's slightly more difficult, because the "unigrams" and "bigrams" files are read in separately.

On Tue, Feb 10, 2015 at 4:35 PM, Peter Organisciak <notifications@github.com

wrote:

Cool. I started this process with the HTRC extracted feature counts. I processed the input, but haven't tried slurping into the DB yet.

Looking through the code, it looks ready to toy with; is this correct? The one part I'm not clear about is line 242 of tokenizer.py http://BookwormDB/bookworm/tokenizer.py: why does nothing happen if the documentid has been seen? Don't we want to have documentid+feature?

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/48#issuecomment-73788759 .

bmschmidt commented 9 years ago

But yes, it's probably good to toy with. I'm waiting on JSTOR DFR request I'm using as a sample data source in my grad course to play with it much more.

Ben

bmschmidt commented 9 years ago

Just dropped out the offending line in commit 6c6fa17c93136ac551472e61254e1fa59aac3f08 .

Looking at my JStor code, I think that probably it would make more sense to just make an actual physical file than a pipe where disk space allows, at least for the unigrams.txt file; that one file needs to be read in twice, not once, during the bookworm creation process.

organisciak commented 9 years ago

Regarding commit 6c6fa17c93136ac551472e61254e1fa59aac3f08, I assume that means that we don't even need to load the seen file list, correct? I'll push a change shortly.

However, I must apologize that I don't quite grasp the seen file list. What is being encoded specifically, and what it used to protect against? Could you give me some more details?

In the commit log, you mention that the ramification of removing the check is that you have to run make clean each time: is it simply meant let you check for new files between ingests?

I ask for two reasons. The first is that I'm trying to make sense of the ramifications of running encodePreTokenizedStream() in parallel.

The second is that getting the seen list is a big bottleneck when multiple processes are running (when necessary, that is, which it currently doesn't seem to be). Here's a chart of the increase in the time for each call to getAlreadySeenList("files/texts/encoded/completed"): after half an hour it's taking 5 minutes each time: image

organisciak commented 9 years ago

So, if I want to run encodePreTokenizedStream() in parallel, do I have to make sure that each batch of piped in lines (doc\tfeature\tcount) has all the features for each doc, so two batches aren't submitting different parts of that doc separately? Or is this not a problem?

organisciak commented 9 years ago

As a note relevant to this issue, I pulled 5c2af2e, which has the awk-based fast_featurecounter.sh. This can be an alternative word counting option for big projects that are using feature-based input.

This uses parallel to split up the unigram list, run a binary sort and then tally features. It actually does this twice, so if the first split gives you 1000 300MB temp files, it will first run a sorted merge and tally smaller batches of those 1000 files, then do that again for the final wordlist.txt. Ideally, this would be recursive, currently the two GNU Parallel routines are hard-coded.

bmschmidt commented 9 years ago

About the seen files list:

The purpose is to keep the encoded file lists from build to build, so that if you run the encode script on 100,000 articles and then add 1,000 more, you only have to re encode the last 1,000 (although currently you will still have to reload all 101,000 into the database, which should be changed.)

You are correct there's currently no need to load that list, and it can be dropped. We should also stop writing to the files.

It should not matter if each process has all the counts for a text.

This is premature optimization--but is this code safe in the event that the number of features is extremely large (greater than 200 million unique features?) I've encountered two sorts of errors in the past:

  1. Trying to hold a token dictionary in memory that contains more keys than the machine has memory: merged sort on flat files should be fine
  2. Overflowing awk's integer size. For importing Google ngrams, each "document" is the total corpus of books published in a single language in a year--the largest numbers how often "the" appears--would break the counter in either awk or perl.
bmschmidt commented 9 years ago

Yeah, 4 byte integers max out at 2 billion--the word count for "the" will probably exceed this for the Hathi corpus at some point, so it matters if this awk code is safe against that. As I said, I can't remember which language this was breaking in.

organisciak commented 9 years ago

Re:#1 - this is the problem this script was meant to avoid, so memory shouldn't cause problems.

Re #2: Good to know, that can definitely become a problem. Awk could be replaced with something else, at a performance cost.

On Tue, Mar 10, 2015, 4:59 PM Benjamin Schmidt notifications@github.com wrote:

Yeah, 4 byte integers max out at 2 billion--the word count for "the" will probably exceed this for the Hathi corpus at some point, so it matters if this awk code is safe against that. As I said, I can't remember which language this was breaking in.

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/48#issuecomment-78157763 .

organisciak commented 9 years ago

On my system with GNU Awk 2.1.7, it looks like you start losing precision after 10^15. Proper order of magnitude, just wrong numbers - weird. I expect the total count of words in the Hathitrust is in the 10^11--10^12 range. Do you want to check if it's different with Mac OS X?

test.txt

test1 1111111111111111
test1 2222222222222222
test2 11111111111111111
test2 22222222222222222
test3 111111111111111111
test3 222222222222222222

Tallying this file:

$ cat test.txt | awk -f mergecounted.awk 
test1 3333333333333333
test2 33333333333333336
test3 333333333333333312

$ cat test.txt test.txt | sort | awk -f mergecounted.awk
test1 6666666666666666
test2 66666666666666672
test3 666666666666666624
organisciak commented 9 years ago

Not sure if I'm overloading this issue, but here's another quirk related to feature input. If there's a discrepancy between metadata and text, the warning for "file XX not found in jsoncatalog.txt" will occur thousands of times for each doc that is missing metadata: once for each word in that doc's vocabulary. Not high priority, I expect it's worth it to ignore until the seen document list is re-implemented for feature input.

bmschmidt commented 9 years ago

On the awk thing, it looks like it's probably just overflowing into floating point, which is a decent behavior. I get the same results on my laptop (OS X 10.9) and desktop (Ubuntu 14.04).

One problem is that after a certain point it is going to completely ignore any feature counts below the precision threshold. But in practice, I can't imagine a situation where this would be an issue. (You'd need more than 1,000,000 tokens each appearing more than 1e+15 times for it to really make a difference.)

It's possible that the problems I was experiencing were on 32-bit perl, or some other unlikely to be encountered system. Or maybe it was an old version of debian that doesn't use gawk.

So let's accept this, but keep an eye out for failures on different architectures.