Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Restoring fast feature counting #89

Closed bmschmidt closed 5 years ago

bmschmidt commented 8 years ago

In wrapping up all the various bookworm calls into a command-line executable over the summer, I removed the ability to ingest unigrams.

I've now restored that, but not using the system calls to @organisciak's "fast_featurecounter.sh". Instead, it just calls a moved version of his (older?) function write_word_ids_from_feature_counts.

For rebuilds of Hathi, I'm not so worried about this: what we really want to do is not rebuild the vocabulary list at all, but instead to just use the file that we've now created.

For Jstor DFR, the Underwood corpus, or other potential feature-count bookworms, however, we may want the faster version. I don't know what the cost is here, really.

My preference for doing this would be as a redefinition of that function so that the external wrappers can keep working. But we could also just switch back to using the Makefile to dispatch if that is easier. (It may be, because the current version is configured to read from stdin.)

bmschmidt commented 5 years ago

Fold into https://github.com/Bookworm-project/BookwormDB/issues/134