Consider nonconsumptive for corpus ingest/management

bmschmidt commented 3 years ago

Hi guys, hope you don't mind if I spam you with a product placement.

I've been working on a complete rewrite of the tokenization and corpus management parts of bookworm into a standalone package called nonconsumptive. https://github.com/bmschmidt/nonconsumptive.

The most important part is a tokenization rewrite that probably doesn't matter for this project. But it also has a reasonably decent corpus ingest function. I thought of it for this project looking at the ingest files @pleonard212 put up for the Yale speeches. The goal there is to pull any of a variety of input formats (mallet-like, ndjson, keyed filenames) into a compressed, mem-mappable single file suitable for all sorts of stuff. I'm using it for scatterplots and bookworms, but I think it would also suit your use case here.

Can't remember how much I've waxed to @duhaime already about the parquet/feather ecosystem, but it's a hugely useful paradigm.

In the immediate term, the prime benefit for you would be that you'd be able to support ingest from a CSV file as well as the JSON format you currently have defined. Great, I guess. But on your side and mine, the real benefit would be that corpora built for this would be no-cost transportable to other systems, and that I could easily drop some intertext hooks into any previously built bookworms. Since the parquet/feather formats are quite portable, for the scale you're working it at it might be possible to bundle it into some kind of upload service.

Let me know if you want to talk more.

pleonard212 commented 3 years ago

I think this would be a great idea, as I'm often running various different things (mallet, Intertext, philoline, etc) on the same set of texts. I've resorted to keeping metadata in sql tables (or nowadays, csvs -> panda dataframes) and writing export logic for slightly-varying contexts and use cases...

bmschmidt commented 3 years ago

Yeah, this is what I'm trying to define and what it would be great to do in concert. CSVs are too lossy on datatypes and don't support list items, both of which are indispensable. No one is doing this yet AFAICT, but parquet and feather have the infrastructure to support full LOD interop on metadata, too, which means it might even be possible to get some librarians on board.

And parquet/feather is nice because it's not platform/language dependent, reads into pandas or R way faster than CSVs, etc.

YaleDHLab / intertext

Consider nonconsumptive for corpus ingest/management #76