NaNoGenMo / 2022

National Novel Generation Month, 2022 edition.
51 stars 0 forks source link

Cross product? #40

Open bensteinberg opened 1 year ago

bensteinberg commented 1 year ago

This is a late entry from the half-bakery, a rough idea about squashing multiple texts together. It is I think similar to but not the same as an idea @rebeccacremona has mentioned. I have in mind some pseudo-mathematical ideas, along with phrases like "cross product" and "convolution", though I doubt this will be any of those.

bensteinberg commented 1 year ago

The repo cross-product includes code and an example text of some 60,539 words, made by combining Moby Dick, The History of Tom Jones, a Foundling, and Middlemarch.

An early passage reads,

“In her sure south, could would Spermacetti thinking honor.” did an was any vice; the till she of of much with sea obstinacy had covered to at. “In obedience never element force him swam, will, girl dived, may he and an chace, pains, marriage Fishes when for colour, a bookworm kind; relation, except, friend, a paint, we of Had grumbling for from with of dislike.

The mechanism produces a text of the same number of sentences as the least of the inputs, and each sentence, of length equal to the shortest at that position among all the inputs, takes words alternately from each.

bensteinberg commented 1 year ago

(I imagine this or something like it has been done before.)

bensteinberg commented 1 year ago

The code takes local text files as inputs. It might be nice to retrieve texts from Project Gutenberg over the network, which would be a chance to get familiar with PG's machine-readable metadata.

bensteinberg commented 1 year ago

This change allows the use of Project Gutenberg text numbers as inputs, caching metadata and text files. The program is now somewhat more error-prone. There is no cache invalidation.

I went down the wrong path at first, beating my head against XPath and lxml until I realized that the catalog file hadn't been updated since 2014. The current catalog, a CSV file, is much easier to deal with (though I'm not using it at the moment), but the head-beating was useful, as I still had to handle the individual works' RDF files.

bensteinberg commented 1 year ago

Almost any result is fun:

$ poetry run cross 5678 9987 9101 | head
Let they pie why fairy.
What is.

"Vows!" COURTSHIP a nurse, for.
"Is these 100 we.
To-day the the all that the there Eochaid them one stronghold to you in table-spoonful but the and but Fremain of "Never," moreover them.
"Worse a another returned.
To-day the drain be that and as of in you, to water, it that bear another not.
What Eochaid Oysters going the own.
That at quart young, of oysters.
bensteinberg commented 1 year ago

Another sample output, of about 65,526 words, was produced by squashing War and Peace, Crime and Punishment, and Anna Karenina:

poetry run cross 2600 2554 1399 > war-crime-karenina.txt

from which

You I. They that’s you. They going jumped the going. How. In not said. In have am, years, be don’t be.

I've also added some input validation; I think I'll call this done.