alvisespano / Polygen

The famous random sentence generator.
Other
39 stars 9 forks source link

polypolygen #11

Open gmarcon opened 6 years ago

gmarcon commented 6 years ago

Create a polypolygen program that, given a text corpus, infers a grammar file to generate sentences that would appear like extracted from that corpus.

tajmone commented 6 years ago

Hi @gmarcon, this would be really cool. My guess is that it might also be quite difficult to achieve — quite possible, but surely requiring a good grounding in Natural Language Processing (NLP) and a serious amount of time.

With a clear premise that this would be a task beyond my personal abilities (but quite possibly within the skills of Polygen's author @alvisespano), I did peek out of curiosity into libraries like Python Natural Language Toolkit (NLTK).

I think that the NLTK could be used to extract linguistic patterns from corpus of text, but then you'd have to find a way to establish which patterns are of interests. My guess is that you couldn't come up with a completely automated tool, but a tool that could simplify the task.

Even if you extrapolated patterns of speech, you'd still have to decide on vocabulary: if you take away contents from a phrase, you get a pattern, but then you'd have to select appropriate words to fill in the "slots" of a pattern. If you're trying to emulate Shakespear, you might not want to through in random vocabulary, especially not things like "television" which didn't exist at the time (or maybe yes, you want to, because it could be fun? a program can't decide that for you).

I think the beauty and fun of using Polygen consists in the fact that you do that work manually. For example, you see an advertising campaign that pushes a concept with a few key-phrases and slogan; then you try to define its style, and create a "meme" out of it. You then choose a series of possible varations on its slogans, and a number of words that could produce funny random combinations; and that's basically it.

More than a science it's an art.

But it's true, there are scientists that attempt to analyse and reproduce pattarns of speech from text corpus. If you find a way to get these patterns, then you don't really need a program: libraries like NLTK already support lots of *BNF variants, and adding a script to convert your findings to Polygen's EBNF notation is quite easy — the difficoult part is the analysis and processing of large corpus of text.

More than a program, you need the scientific notions on how to do it. The rest would probably boil down to merely converting your findings to Polygen grammar (probably achievable with settings and scripts within a NLP framework, without need of a new program).

But don't take my word for it, I'm no expert (I might wrong, expcet on the point that doing it manually is the fun-part).

On the other hand, one could think of a "dumb" tool to aid in the task of isolating patterns: a program with some basic notion on how to isolate sentences from punctuation, and taking user input to select, discard, catalog and breakup into atoms samples of speech from the text, while translating all this into a Polygen grammar in realtime (for testing purpose). This sounds achievable, and probably also easier on the user side (for a corpus not to huge in size). Also, it wouldn't take aways the fun I was speaking about, it would just make it less tedious.

Any ideas how the user interface of such a program should be?

alvisespano commented 6 years ago

Ciao Giulio! That is far beyond my personal intentions (and time) for polygen. I'm just going to revamp it a little, perhaps including library and separate-compilation support, though I am not going to apply sophisticate machine learning - because I am not an expert, in the first place. Sorry for that :(