catseye / NaNoGenLab

Experiments conducted for NaNoGenMo 2014
https://github.com/catseye/NaNoGenLab#nanogenlab
The Unlicense
24 stars 2 forks source link

Extracting the reusable bits #1

Closed cpressey closed 11 months ago

cpressey commented 9 years ago

If you are watching this, please feel free to vote on what you'd like to see. Otherwise, this is just some notes to myself.

Basic idea

Find the bits that could be reused and repackage them for reusability.

Two general kinds of reuse

Almost everything here is Python, and that has the good fortune to be usable either way:

Thing is, the first way, the tools can inter-operate with generators/other tools in other languages.

Other thing is that, for the second, the directory that package is in has to be on your PYTHONPATH. (Then this degenerates into a discussion about how best to do that, and how I don't like almost every packaging system, etc.)

So the first way should generally be preferred, but ideally, support both.

Possible things to extract

Could keep them in this repo, but it's a "lab". Probably better to put them in a different (and new) repo.

What to call package?

christiaanw commented 9 years ago

What kind of submodules are you thinking of? I see there's cleaning functionality for source text (guten-gutter), but another source of usable text is the internet archive, and that also needs cleaning up. You've also done some markov chains in the lab, so that's an obvious one. And then you mention post-processing, which mainly is spell-checking and stuff like correcting a/an? That could be extended with some verb correcting, or possibly into a sentence builder taking subject, object, indirect object adjectives, verb(s)?

So then you have a pipeline: preprocess, remix, postprocess. What kinds of remixing would be interesting? Also things like poetic-inventory... You've done spoonerisms, which makes me think of homophones, for which there already is some code floating around.

I would posit that much of what is happening here is text remixing, or procedural generation, which is different from what is happening in The Swallows. Is that something that might be included (if you'd like to make a package that's for text generation what NLTK is for NLP, you could)?

And a possible question is whether you'd want to include functionality based on syntactic parsing...

cpressey commented 9 years ago

@christiaanw Funny story! I started thinking about pipelines, and then I started thinking probably too deeply about pipelines, and I pinball'ed around for a bit and came to the conclusion that I need to design and implement the next version of Treacle. Which doesn't make a lot of sense, because term rewriting doesn't have anything to do with pipelines, does it?

(You can see the thread on the generativetext forum for more information, although I stopped writing down my thoughts at some point because not sure where I'm going. And feel free to chip in on any of those threads...)

But to try to give a better answer -- anything I would be motivated, myself, to extract would be of general applicability (so the spoonerizer, for example, is not in the cards -- although it's public domain so if someone else wants to do that, please, by all means.) Right now that basically encompasses:

The first and third of those are text filters, but thinking about generalizing into a pipeline for text remixing is what set me off on a zany trajectory. So it might be better to just think of them as two separate and unrelated text filters, maybe even put them in each their own repository.

cpressey commented 8 years ago

Well, I said I'd do something with this, so, with NaNoGenMo 2015 fast approaching, I did eventually do something with this.

I abandoned the idea of exposing the over-engineering pipeline framework as a library, but it's still in the first two.

They all require Python 2.x (tested on Python 2.7.6 but will probably still work on some earlier versions) but only seedbank requires that you write your script in Python - the other two can be used as stand-alone tools.

They're all in the public domain. Issues of bug reports and pull requests for reasonable enhancements welcome.