Closed cpressey closed 11 months ago
What kind of submodules are you thinking of? I see there's cleaning functionality for source text (guten-gutter), but another source of usable text is the internet archive, and that also needs cleaning up. You've also done some markov chains in the lab, so that's an obvious one. And then you mention post-processing, which mainly is spell-checking and stuff like correcting a/an? That could be extended with some verb correcting, or possibly into a sentence builder taking subject, object, indirect object adjectives, verb(s)?
So then you have a pipeline: preprocess, remix, postprocess. What kinds of remixing would be interesting? Also things like poetic-inventory... You've done spoonerisms, which makes me think of homophones, for which there already is some code floating around.
I would posit that much of what is happening here is text remixing, or procedural generation, which is different from what is happening in The Swallows. Is that something that might be included (if you'd like to make a package that's for text generation what NLTK is for NLP, you could)?
And a possible question is whether you'd want to include functionality based on syntactic parsing...
@christiaanw Funny story! I started thinking about pipelines, and then I started thinking probably too deeply about pipelines, and I pinball'ed around for a bit and came to the conclusion that I need to design and implement the next version of Treacle. Which doesn't make a lot of sense, because term rewriting doesn't have anything to do with pipelines, does it?
(You can see the thread on the generativetext forum for more information, although I stopped writing down my thoughts at some point because not sure where I'm going. And feel free to chip in on any of those threads...)
But to try to give a better answer -- anything I would be motivated, myself, to extract would be of general applicability (so the spoonerizer, for example, is not in the cards -- although it's public domain so if someone else wants to do that, please, by all means.) Right now that basically encompasses:
The first and third of those are text filters, but thinking about generalizing into a pipeline for text remixing is what set me off on a zany trajectory. So it might be better to just think of them as two separate and unrelated text filters, maybe even put them in each their own repository.
Well, I said I'd do something with this, so, with NaNoGenMo 2015 fast approaching, I did eventually do something with this.
I abandoned the idea of exposing the over-engineering pipeline framework as a library, but it's still in the first two.
They all require Python 2.x (tested on Python 2.7.6 but will probably still work on some earlier versions) but only seedbank requires that you write your script in Python - the other two can be used as stand-alone tools.
They're all in the public domain. Issues of bug reports and pull requests for reasonable enhancements welcome.
If you are watching this, please feel free to vote on what you'd like to see. Otherwise, this is just some notes to myself.
Basic idea
Find the bits that could be reused and repackage them for reusability.
Two general kinds of reuse
Almost everything here is Python, and that has the good fortune to be usable either way:
python package/foo.py blah blah
to use it as a toolfrom package import foo
to use it in a Python script.Thing is, the first way, the tools can inter-operate with generators/other tools in other languages.
Other thing is that, for the second, the directory that
package
is in has to be on your PYTHONPATH. (Then this degenerates into a discussion about how best to do that, and how I don't like almost every packaging system, etc.)So the first way should generally be preferred, but ideally, support both.
Possible things to extract
What repo to put 'em in?
Could keep them in this repo, but it's a "lab". Probably better to put them in a different (and new) repo.
What to call
package
?