NaNoGenMo / 2020

National Novel Generation Month, 2020 edition.
76 stars 0 forks source link

Markov text with citations #18

Open serin-delaunay opened 3 years ago

serin-delaunay commented 3 years ago

A common criticism of GPT language models is that they plagiarise text from the internet. As an experiment in smoothing over this issue, I will make a Markov chain language model that tags each n-gram observation with the location of the original in the source text.

This means that in the text generation stage, each output token can cite the n-gram it was drawn from in the source text. In the generated novel, I'll put this info in footnotes. This should make the resulting text much better sourced, and give the reader clarity about the true origin of any deep insights found in the novel.

Haven't decided what source text to use. Maybe Shakespeare (all lines have a standard identifier), GPT research papers, Moby Dick...

Caveats:

serin-delaunay commented 3 years ago

If there's time I might also do a slightly more serious separate entry that doesn't boil down to "YAMC".

pjfpotter commented 3 years ago

Why not write an entire novel of footnotes? Each footnote is a citation of the n-gram that would have been in the novel but then wasn't because it was replaced by it's own citation. Let's see how deep this rabbit hole goes.

serin-delaunay commented 3 years ago

There's one like that at https://github.com/NaNoGenMo/2019/issues/68; I'd rather keep this one simple. The footnotes will have a pretty well-defined format, so they wouldn't need to be Markov-generated or nested.

greg-kennedy commented 3 years ago

This is the one that comes to mind when I think of obsessive footnotes: https://github.com/NaNoGenMo/2019/issues/127

serin-delaunay commented 3 years ago

Yeah, that's closer to what I'm going for here. Thanks for the link, I saw that one last year but it had slipped my mind.

verachell commented 3 years ago

What a cool idea!