dariusk / NaNoGenMo-2015

National Novel Generation Month, 2015 edition.
341 stars 21 forks source link

Who Lives in a Pineapple Under the Sea? MIS-TER DAR-CY! #133

Open toomuchpete opened 8 years ago

toomuchpete commented 8 years ago

My goal is to process a book from Project Gutenberg's Top 100 list, possibly Pride and Prejudice. The book will remain largely intact, but the quotes will be replaced with quotes generated from corpus compiled from Spongebob Squarepants fanfic (collected from FanFiction.net).

Probably the most jarring thing to solve is getting the names right. It would be disorienting to see Spongebob's name littered around Pride and Prejudice, but maybe that will be funny? Or there's probably some replacement that can be done, translating character names between the two.

toomuchpete commented 8 years ago

As requested, @kyfast.

dariusk commented 8 years ago

Love it.

KyFaSt commented 8 years ago

:pineapple:

MichaelPaulukonis commented 8 years ago

There's been dialogue swapping in the past, and I did character/noun swapping between two texts as well. But nobody has tackled the problem of getting references straight. I thought about it as one of my projects this year, but don't know if I'll get to it.

I won't be sad if you do the work for the rest of us!

enkiv2 commented 8 years ago

The word2vec-related projects have managed to translate references. If you make an explicit list of proper names in each source, you can probably make an explicit translation or use word2vec to produce correspondences for you.

On Thu, Nov 5, 2015 at 9:54 AM Michael Paulukonis notifications@github.com wrote:

There's been dialogue swapping in the past, and I did character/noun swapping between two texts as well. But nobody has tackled the problem of getting references straight. I thought about it as one of my projects this year, but don't know if I'll get to it.

I won't be sad if you do the work for the rest of us!

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2015/issues/133#issuecomment-154083202 .

MichaelPaulukonis commented 8 years ago

I would be intrigued to see this work; one problem is eponyms, nicknames, gender-references, and titles. "King" posed a particular problem for me, as the pos-tagger I was using always decided it was a verb. @enkiv2 - can you link to one or more projects that managed to translate references?

enkiv2 commented 8 years ago

Take a look at the translated titles and authors in https://github.com/dariusk/NaNoGenMo-2015/issues/72 ; this is what I mean. Word2vec correctly figured out that certain proper nouns were similar in the same way that it figured out that certain nouns are similar in general, from what I understand. If you whitelist proper nouns and have an explicit list of identical ways of referring to the same person which you normalize, you can do that with better reliability, but at that point you've done most of the work of creating a correspondence table between sets of characters and you might as well just do string replacement on them.

On Thu, Nov 5, 2015 at 10:46 AM Michael Paulukonis notifications@github.com wrote:

I would be intrigued to see this work; one problem is eponyms, nicknames, gender-references, and titles. "King" posed a particular problem for me, as the pos-tagger I was using always decided it was a verb. @enkiv2 https://github.com/enkiv2 - can you link to one or more projects that managed to translate references?

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2015/issues/133#issuecomment-154097571 .

ikarth commented 8 years ago

My Gutenberg Shuffle from 2013 attempted to respect references, but it turned out to be a bigger project than anticipwords.It sort of got gender right, though I'd redo it if I went that way again.

Note that, at least for the libraries in gensim, pos-taggers work better on sentences rather than individual words.

enkiv2 commented 8 years ago

I was thinking you'd operate on the whole sentences, but then only pay attention to the whitelisted words.

On Thu, Nov 5, 2015 at 9:28 PM Isaac Karth notifications@github.com wrote:

My Gutenberg Shuffle from 2013 attempted to respect references, but it turned out to be a bigger project than anticipwords.It sort of got gender right, though I'd redo it if I went that way again.

Note that, at least for the libraries in gensim, pos-taggers work better on sentences rather than individual words.

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2015/issues/133#issuecomment-154264928 .

michelleful commented 8 years ago

It sounds like what you'd need (if you did choose to somehow "translate" the names) is to have the names in the Spongebob corpus tagged for named entities, but in case it's useful to have a version of P&P that is name-tagged, the P&P e-text at Pemberley.com is conveniently so.

<P>``<A HREF="ppdrmtis.html#MrBennet">Mr.&#32;Bennet</A>, how can you abuse your own children in such way?  You take delight in vexing me.  You have no compassion on my poor nerves.''</P>