Open bobpoekert opened 11 years ago
Project gutenberg is pretty clean if you pull off the license and then eliminate single line-breaks (and clean up whitespace).
On Tue, Nov 12, 2013 at 3:40 PM, Bob Poekert notifications@github.comwrote:
I'm playing with doing multiple layers of markov chains. ex: tokens get grouped into sentences with sentence features being the top k keywords, sentences get grouped into paragraphs the same way, and paragraphs into chapters. each layer (token, sentence, paragraph, chapter) gets its own markov table composed of the featured from the layer below. when doing generation, you pick a feature from the higher layer and use that to constrain the features you pick in the lower layer.
Though first I have to get some clean training data (which is what I'm working on now).
— Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/65 .
It's clean enough if you don't care about structure (paragraphs, chapters, etc), but I do. It'd probably be more accurate to say I was doing segmentation than cleaning.
:+1: sounds great; I'm working on a similar concept (#57).
You may be able to scrape fanfiction.net (someone else is, and they have chapter separation). Alternately, you may be able to get subsets of gutenberg in a more structured format (I think their epubs might embed XML), or you can look for TREC corpora (which are highly structured mainly for the benefit of people's toy search engines, but which are sometimes used in academic machine learning projects for that reason). I don't think anybody's doing manual sentence-tagging, though; you'll probably have to parse out sentence breaks yourself.
On Wed, Nov 13, 2013 at 10:22 AM, Jason Hutchens notifications@github.comwrote:
[image: :+1:] sounds great; I'm working on a similar concept (#57https://github.com/dariusk/NaNoGenMo/issues/57 ).
— Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/65#issuecomment-28403070 .
I already managed to get chapters/paragraphs out of the html versions of gutenberg. I'm moving on to actual NLP now. :)
I'm playing with doing multiple layers of markov chains. ex: tokens get grouped into sentences with sentence features being the top k keywords, sentences get grouped into paragraphs the same way, and paragraphs into chapters. Each layer (token, sentence, paragraph, chapter) gets its own markov table composed of the features from the layer below. When doing generation, you pick a feature from the higher layer and use that to constrain the features you pick in the lower layer.
Though first I have to get some clean training data (which is what I'm working on now).
Here's my repo: https://github.com/rabidsnail/NaNoGenMo