Extracting the reusable bits

cpressey commented 9 years ago

If you are watching this, please feel free to vote on what you'd like to see. Otherwise, this is just some notes to myself.

Basic idea

Find the bits that could be reused and repackage them for reusability.

Two general kinds of reuse

standalone tools that take input files and produce output files
libraries which can be imported and used in a program

Almost everything here is Python, and that has the good fortune to be usable either way:

python package/foo.py blah blah to use it as a tool
from package import foo to use it in a Python script.

Thing is, the first way, the tools can inter-operate with generators/other tools in other languages.

Other thing is that, for the second, the directory that package is in has to be on your PYTHONPATH. (Then this degenerates into a discussion about how best to do that, and how I don't like almost every packaging system, etc.)

So the first way should generally be preferred, but ideally, support both.

Possible things to extract

guten-gutter: yes. It does at least as good a job as gutenizer now, and is more extensible (can clean out illustrations, etc.) Built on a framework of "cleaners", which are just line-based filters/transformers.
fetch-chronam: probably not. Or, rather, only if it is possible to merge it with wikimedia-illustrations somehow. (The scraping code there really deserves to be extracted.)
image-calipers: ehh, probably not.
autoseed: yes; needs to be fleshed out since it is currently only a sketch in my NaNoGenMo-2014 issue, and makes slightly more sense as a library, but could probably be turned into a tool too.
zzzpostprocess.py: yes; several entries had extra spaces between punctuation and so forth, so this could definitely be useful if refactored/expanded upon. Would need some unit tests.
various copy-and-pasted code for splitting the input into sentences: well, maybe. Apparently NLTK can do this too, and it probably does it better. (But still, this would be public domain.)
What repo to put 'em in?

Could keep them in this repo, but it's a "lab". Probably better to put them in a different (and new) repo.

What to call `package`?

NaNoGenLib -- I see what I did there, but maybe it's a little presumptuous. You certainly don't need to use it for NaNoGenMo, and for NaNoGenMo you certainly don't need to use it.
KTLN, a toolkit for unnatural language processing -- I also see what I also did there. Putative backronym is "Kitten Talks Like Nixon", although if "N" could stand for "Novel" somehow...
Maybe something more laconic like GenLabEquipment
Or maybe something perfectly strange and oblique like Bowl of ceramic abbots

christiaanw commented 9 years ago

What kind of submodules are you thinking of? I see there's cleaning functionality for source text (guten-gutter), but another source of usable text is the internet archive, and that also needs cleaning up. You've also done some markov chains in the lab, so that's an obvious one. And then you mention post-processing, which mainly is spell-checking and stuff like correcting a/an? That could be extended with some verb correcting, or possibly into a sentence builder taking subject, object, indirect object adjectives, verb(s)?

So then you have a pipeline: preprocess, remix, postprocess. What kinds of remixing would be interesting? Also things like poetic-inventory... You've done spoonerisms, which makes me think of homophones, for which there already is some code floating around.

I would posit that much of what is happening here is text remixing, or procedural generation, which is different from what is happening in The Swallows. Is that something that might be included (if you'd like to make a package that's for text generation what NLTK is for NLP, you could)?

And a possible question is whether you'd want to include functionality based on syntactic parsing...

cpressey commented 9 years ago

@christiaanw Funny story! I started thinking about pipelines, and then I started thinking probably too deeply about pipelines, and I pinball'ed around for a bit and came to the conclusion that I need to design and implement the next version of Treacle. Which doesn't make a lot of sense, because term rewriting doesn't have anything to do with pipelines, does it?

(You can see the thread on the generativetext forum for more information, although I stopped writing down my thoughts at some point because not sure where I'm going. And feel free to chip in on any of those threads...)

But to try to give a better answer -- anything I would be motivated, myself, to extract would be of general applicability (so the spoonerizer, for example, is not in the cards -- although it's public domain so if someone else wants to do that, please, by all means.) Right now that basically encompasses:

guten-gutter, because it's slightly better than gutenizer at this point (although I should also see if extracting the text from a PG HTML file, using BeautifulSoup, would produce better results)
autoseed, even though it's not even checked in here
the postprocessing stuff (punctuation-and-spacing-and-limited-diction fixer) which also isn't even checked in here.

The first and third of those are text filters, but thinking about generalizing into a pipeline for text remixing is what set me off on a zany trajectory. So it might be better to just think of them as two separate and unrelated text filters, maybe even put them in each their own repository.

cpressey commented 8 years ago

Well, I said I'd do something with this, so, with NaNoGenMo 2015 fast approaching, I did eventually do something with this.

T-Rext cleans up spacing and punctuation in a text file.
Guten-gutter attempts to strip Project Gutenberg boilerplate from a text file (and succeeds more often than gutenizer does)
seedbank allows a 1-line change to any Python script to allow its random seeds to be recorded and replayed.

I abandoned the idea of exposing the over-engineering pipeline framework as a library, but it's still in the first two.

They all require Python 2.x (tested on Python 2.7.6 but will probably still work on some earlier versions) but only seedbank requires that you write your script in Python - the other two can be used as stand-alone tools.

They're all in the public domain. Issues of bug reports and pull requests for reasonable enhancements welcome.

catseye / NaNoGenLab