dariusk / NaNoGenMo-2014

National Novel Generation Month, 2014 edition.
257 stars 17 forks source link

Resources #1

Open dariusk opened 10 years ago

dariusk commented 10 years ago

This is an open issue where you can comment and add resources that might come in handy for NaNoGenMo.

There are already a ton of resources on the old resources thread for the 2013 edition.

aparrish commented 10 years ago

Use ConceptNet to start writing your own Tale-Spin-like stories. "There was a kitten. The kitten was someone's pet. The kitten wanted to explore. Surfing the web is used for exploring. You need to connect to the internet in order to surf the web."

aparrish commented 10 years ago

I also feel like there may be relevant things in Michael Cook's list of procedural generation tutorials.

enkiv2 commented 10 years ago

I recommend, in addition to ConceptNet, CMU NELL. It's another ontology project, this time generated by parsing some web pages. The data sets are here: http://rtw.ml.cmu.edu/rtw/resources

(There's also the potential for some of us to use IBM Watson, probably. I don't have that resource, but now that the APIs are open for some definition of open, somebody could use them creatively I suppose)

On Mon, Oct 20, 2014 at 1:09 PM, Allison Parrish notifications@github.com wrote:

I also feel like there may be relevant things in Michael Cook's list of procedural generation tutorials http://procjam.tumblr.com/post/99689402659/procedural-generation-tutorials-getting-started .

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-59812520 .

cpressey commented 10 years ago

Project Gutenberg is now being mirrored on Github -- https://github.com/GITenberg -- although exactly what benefit that may or may not bring to you, the potential NaNoGenMo participant, I cannot say. I mention it mainly because I'm not sure if it was around last time -- I only noticed it recently (a month or two ago.)

tullyhansen commented 10 years ago

Generate a sufficiently complicated graph, and then explicate it for 50,000 words with wordgraph? https://wordgraph.readthedocs.org/en/latest/

ikarth commented 10 years ago

http://www.ark.cs.cmu.edu/TweetNLP/ I came across this today: Tweet NLP

We provide a tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools.

jeffThompson commented 10 years ago

Got a postcard from Blurb yesterday: if you want to print your finished novel, get 20% off using the code HOLIDAY14.

Expires November 30, so you'll have to finish a little early...

swizzard commented 10 years ago

2nding Tweet NLP--I'm doing some twitter-related stuff, and hacked around on it via Clojure for a bit last night. It's not terribly well-documented outside command-line stuff, but the source code is clear and well-commented.

I also can't recommend NLTK strongly enough. It's definitely intimidating if you're not familiar with NLP, but that's mostly because it's got so much stuffed into it--wordnet, several phenomenal corpora (including a bunch of Project Gutenberg stuff), markov generators, & more. NLTK3 is even Py3K-compatible, for all of you who really want to put some unicode in your novels/hate print statements.

christiaanw commented 10 years ago

2nding @swizzard about NLTK (though I just uninstalled NLTK3 in favor of 2.04 because Markov generators are not included in NLTK3), and adding Pattern to the mix. Pattern has linguistics (for parsing and information extraction based on chunk/word patterns), web-search (Google, Bing, Yahoo, Twitter, Wikipedia), web-crawling, language modelling (TF-IDF), classification, and commonsense reasoning. They're pretty approachable using the included examples. The website has some usage examples. patent-generator uses the search module from Pattern to generate patents from literary texts.

On the literary side I would like to mention two resources on Ubuweb: It's Anthology of Conceptual Writing and /ubu editions. There's The first thousand numbers classified in alphabetical order by Claude Closky, Name, A Novel by Toadex Hobgrammathon and All the Numbers from Numbers by Kenneth Goldsmith, and possibly more stuff which could have (or has) been been done by computer.

swizzard commented 10 years ago

I had no idea they'd taken Markov generators out of NLTK3! That's such a bummer.

Pattern sounds great though. I'll definitely be looking into it.

hugovk commented 10 years ago

What's a good way to create a PDF book from plain text?

Are there any handy (Python) scripts to generate a half-decent PDF by throwing a bunch of text at it?

Or is it better to export to PDF from Libre/Open Office?

cpressey commented 10 years ago

@hugovk This is a biased opinion of course, but I like http://johnmacfarlane.net/pandoc/ for document format conversions. It's written in Haskell, which may or may not be your cup of tea, but you can shell it out from Python. I'm sure there are Python-specific tools for generating PDFs too (generating a PDF is actually not that difficult, especially if it's just text no images,) but I don't know any offhand.

jeffThompson commented 10 years ago

@hugovk – I've tried, with limited success, to use ReportLab. It's very confusing and hard to get nice formatting.

I would suggest generating a .txt file, then using Word/InDesign to format and export, or Ghostscript on the command line, or called from Python at the end of your script.

dariusk commented 10 years ago

@hugovk, Consider generating markup instead of text. I typically generate html, open it in a browser, and then use the print to PDF function.

This time around I might generate LaTeX markup instead, which renders very nicely to PDF.

tullyhansen commented 10 years ago

@hugovk Seconding Darius on LaTex - my preferred intermediary format from plain text is Fletcher Penney's MultiMarkdown, which is reasonably painless once up and running. Happy to help if I can!

hugovk commented 10 years ago

Thanks for all the suggestions! I think I'll keep it simple and just use plain text and PDF.

ReportLab looks like a powerful library (and yes, confusing) for making PDFs in Python.

Here's one script intended for turning Python source code into a PDF, but --mode mono gives a good enough quick book and has page numbers and a header. I'll probably use/modify this.

Here's another example (via).

Finally, here's another that doesn't use ReportLab or any special libraries.

enkiv2 commented 10 years ago

In case anybody is particularly masochistic, here's a science fiction plot generator in excel: http://www.j-paine.org/excelsior/repository/spin/

On Mon, Oct 27, 2014 at 8:35 AM, Hugo notifications@github.com wrote:

Thanks for all the suggestions! I think I'll keep it simple and just use plain text and PDF.

ReportLab looks like a powerful library (and yes, confusing) for making PDFs in Python.

Here's one script https://openbookproject.googlecode.com/svn/tangle/html2pdf/reportlab_2_1/reportlab/tools/py2pdf/py2pdf.py intended for turning Python source code into a PDF, but --mode mono gives a good enough quick book and has page numbers and a header. I'll probably use/modify this.

Here's another example http://two.pairlist.net/pipermail/reportlab-users/attachments/20070213/80c61e82/attachment.obj (via http://two.pairlist.net/pipermail/reportlab-users/2007-February/005791.html ).

Finally, here's another http://code.activestate.com/recipes/532908/ that doesn't use ReportLab or any special libraries.

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-60585778 .

hugovk commented 10 years ago

Actually, MultiMarkdown looks good, but can someone point me in the right direction for going from there via LaTeX to PDF?

enkiv2 commented 10 years ago

From the man page for pdftex(1): PDFTEX(1) Web2C 2009 PDFTEX(1)

NAME pdftex, pdfinitex, pdfvirtex - PDF output from TeX

SYNOPSIS pdftex [options] [& format ] [ file | \ commands ]

DESCRIPTION Run the pdfTeX typesetter on file, usually creating file.pdf. If the file argument has no extension, ".tex" will be appended to it. Instead of a filename, a set of pdfTeX commands can be given, the first of which must start with a backslash. With a &format argument pdfTeX uses a different set of precompiled commands, contained in format.fmt; it is usually better to use the -fmt format option instead.

   pdfTeX is a version of TeX, with the e-TeX extensions, that can

create PDF files as well as DVI files.

   In DVI mode, pdfTeX can be used as a complete replacement for the

TeX engine.

   The  typical use of pdfTeX is with a pregenerated formats for which

PDF output has been enabled. The pdftex command uses the equivalent of the plain TeX format, and the pdflatex command uses the equivalent of the LaTeX format. To generate formats, use the -ini switch.

   The pdfinitex and pdfvirtex commands are pdfTeX's analogues to the

initex and virtex commands. In this installation, if the links exist, they are symbolic links to the pdftex executable.

   In PDF mode, pdfTeX can natively handle the PDF, JPG, JBIG2, and PNG

graphics formats. pdfTeX cannot include PostScript or Encapsulated PostScript (EPS) graphics files; first convert them to PDF using epstopdf(1). pdfTeX's handling of its command-line arguments is similar to that of of the other TeX programs in the web2c implementa‐ tion.

On Mon, Oct 27, 2014 at 3:01 PM, Hugo notifications@github.com wrote:

Actually, MultiMarkdown looks good, but can someone point me in the right direction for going from there via LaTeX to PDF?

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-60650208 .

dariusk commented 10 years ago

Quoting man pages is, much like man itself, never helpful. Much better to link to prose tutorials.

Anyway I believe the question was how the whole mmd -> LaTeX -> PDF pipeline would look.

mmd to LaTeX

LaTeX to PDF

cpressey commented 10 years ago

All this talk about generating PDFs got me remembering some old code I wrote to do just that, in Lua, which I decided to dig out of the attic and throw up on Github today. Of course, it doesn't do any of that layout kind of stuff with the spacing and the kerning and the orphans and the gutters and the suchlike, but if anyone is planning on using Lua to generate a 500-page long piece of concrete poetry ... well, it might be marginally more useful than the kinds of things my cat throws up, anyway.

lizadaly commented 10 years ago

For PDF support with potentially rich layout that's also developer-friendly, I recommend using CSS3 Paged Media: http://alistapart.com/article/building-books-with-css3. The best tools are commercial and expensive, but they typically have trials.

moonmilk commented 10 years ago

For I got an alligator as a pet, I had python generate really simple markdown, used a simple online converter to turn the markdown to html, and then printed the html to a PDF. It wasn't pretty but it was easy!

briansuda commented 10 years ago

If you want or need HTML to PDF creation, the easiest way I have found is to use PhantomJS, I've written a HTML2PDF as a service. You can click the deploy to Heroku and have your own running in minutes. http://github.com/optional-is/html2pdf

ikarth commented 10 years ago

Since we've brought up interactive novels, Curveship is a system for interactive narrative simulation written in Python. I imagine you could conceivably use it as part of a non-interactive novel generation process. https://github.com/nickmontfort/curveship

hugovk commented 10 years ago

If you want OCR text/images from newspaper pages, I made a simple Python wrapper around the Chronicling America API, full of scanned newsapers. From each search result you get an id (eg '/lccn/sn86063756/1911-03-23/ed-1/seq-3/') which you can easily get the image (eg http://chroniclingamerica.loc.gov//lccn/sn86063756/1911-03-23/ed-1/seq-3.jp2).

The Library of Congress is also on Flickr, along with over a million images from the British Library, 2.6m Internet Archive Book Images and scores more also in Flickr Commons.

enkiv2 commented 10 years ago

Another corpus that people might look at is the archive of all state of the union speeches: http://millercenter.org/president/speeches

If nothing else, a markov model fed with the speeches of very dissimilar politicians can be entertaining.

On Fri, Oct 31, 2014 at 6:45 AM, Hugo notifications@github.com wrote:

If you want OCR text/images from newspaper pages, I made a simple Python wrapper https://github.com/hugovk/chroniclingamerica.py around the Chronicling America API, full of scanned newsapers. From each search result you get an id (eg '/lccn/sn86063756/1911-03-23/ed-1/seq-3/') which you can easily get the image (eg http://chroniclingamerica.loc.gov//lccn/sn86063756/1911-03-23/ed-1/seq-3.jp2 ).

The Library of Congress is also on Flickr https://secure.flickr.com/photos/library_of_congress/, along with over a million images from the British Library https://secure.flickr.com/photos/britishlibrary, 2.6m Internet Archive Book Images https://secure.flickr.com/photos/internetarchivebookimages/ and scores more also in Flickr Commons https://secure.flickr.com/commons .

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-61249437 .

cpressey commented 10 years ago

http://www.qdl.qa/en has lots of scanned material (letters from the India Company, etc.) although you'd have to figure out how to best scrape it and OCR it if you wanted to use the actual words.

It was brought to my attention by this BBC News article. Incidentally, if you are looking for a name for your generator, you could probably do worse than naming it "Warris Ali":

hugovk commented 10 years ago

I found this handy script that takes MultiMarkdown and uses ebook-convert (part of Calibre) to create MOBI, EPUB and PDF output.

http://ianhocking.com/2013/06/23/writing-a-novel-using-markdown-part-two/

lcooke commented 10 years ago

A few months ago I made an API for the Aeneid, which people here might find a use for. It includes a few English translations, a Latin version, and keyword search.

http://aeneid.eu/api/

ikarth commented 10 years ago

The ProcGen Jam is going on right now, and might have a couple of useful resources in and around it: http://procjam.tumblr.com/

hugovk commented 10 years ago

twarc: "a command line tool for archiving JSON twitter search results"

moonmilk commented 10 years ago

Maybe or maybe not a useful resource: if you can't think of a title for your novel, my wrimo-titler.py will steal one for you, from people's #nanowrimo posts on twitter.

https://github.com/moonmilk/nanogenmo2014

gambolputty commented 10 years ago

Two other useful resources:

For those not in the know of scraping – https://www.kimonolabs.com – build custom APIs and scrape any website time scheduled. Output can be JSON, XML, CSV.

If you don't want to deal with the messy structure of Wikipedia, use https://www.freebase.com – easy structure, up to 100,000 API read calls per DAY.

moonmilk commented 10 years ago

If you're generating steamy novels or writing a sextbot, and you need an industry standard mapping between hexadecimal numbers and body parts, you could do worse than the USB Device Class Definition for Human Interface Devices http://www.usb.org/developers/hidpage/

00 None 
01 Hand 
02 Eyeball 
03 Eyebrow 
04 Eyelid 
05 Ear 
06 Nose 
07 Mouth 
08 Upper lip 
09 Lower lip 
0A Jaw 
0B Neck 
0C Upper arm 
0D Elbow 
0E Forearm 
0F Wrist 
10 Palm 
11 Thumb 
12 Index finger 
13 Middle finger 
14 Ring finger 
15 Little finger 
16 Head 
17 Shoulder 
18 Hip 
19 Waist 
1A Thigh 
1B Knee 
1C Calf 
1D Ankle 
1E Foot 
1F Heel 
20 Ball of foot 
21 Big toe 
22 Second toe 
23 Third toe 
24 Fourth toe 
25 Little toe 
26 Brow 
27 Cheek 
28-FF Reserved 
* The Qualifier field indicates which hand (or half of the body) the designator is 
defining. This may not apply to for some devices.
moonmilk commented 10 years ago

Are there any nice print css templates out there? There's tons of free template sites for web design, but none for novel design... for some reason.

ikarth commented 10 years ago

@moonmilk The Magic Book Project might help, if you're looking to format a whole book: https://github.com/runemadsen/Magic-Book-Project

moonmilk commented 10 years ago

I figured out enough print CSS to make my PDF look like a cheap paperback instead of a printed out web page. It's very satisfying! I learned what I needed to know from here:

http://www.tutorialspoint.com/css/css_paged_media.htm

I also added a slightly tacky google font, for more distance from the default web look.

Check out wrimo.css and novelette3.html for the details, and novelette3.pdf shows the result. https://github.com/moonmilk/nanogenmo2014

https://github.com/dariusk/NaNoGenMo-2014/issues/99#issuecomment-63325003

enkiv2 commented 10 years ago

If your novel involves automatic translation, there's a ruby gem called termit which automates calling out to google translate. It's useful to note that it will also produce synthspeech, although you'd need to edit the script to actually keep any of the audio -- I think it downloads the audio file that google translate creates as an audio preview and then plays it with mpeg123. I used it to simulate translationparty.com in my tparty script: https://github.com/enkiv2/tparty

Another semantic noise generator is my synonym warper: https://github.com/enkiv2/synonym-warp

If you'd like to generate C code to generate music (rather than simply generating music), try this: https://github.com/enkiv2/musicfromsmallprograms

On Mon Nov 17 2014 at 10:58:56 AM Ranjit Bhatnagar notifications@github.com wrote:

I figured out enough print CSS to make my PDF look like a cheap paperback instead of a printed out web page. It's very satisfying! I learned what I needed to know from here:

http://www.tutorialspoint.com/css/css_paged_media.htm

I also added a slightly tacky google font, for more distance from the default web look.

Check out wrimo.css and novelette3.html for the details, or just ask me... https://github.com/moonmilk/nanogenmo2014

99 (comment)

https://github.com/dariusk/NaNoGenMo-2014/issues/99#issuecomment-63325003

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-63326100 .

moonmilk commented 10 years ago

And here's a python recipe for scraping speech from google translate (almost certainly against the terms of service)

http://stackoverflow.com/a/7227232

dariusk commented 10 years ago

Emily Short just put up a blog post with a ton of text generation resources (geared at interactive fiction, but useful here as well).

MichaelPaulukonis commented 10 years ago

Fraktur converts your text to 𝔣𝔯𝔞𝔨𝔱𝔲𝔯 𝔲𝔫𝔦𝔠𝔬𝔡𝔢 𝔠𝔥𝔞𝔯𝔞𝔠𝔱𝔢𝔯𝔰.

That's.... different.

MichaelPaulukonis commented 10 years ago

NAMES NAMES NAMES

This is a collection of names from around the world which was initially intended to help provide character names for live role-players. It includes short historical backgrounds, male and female first names or personal names, and surnames or family names, from many countries and periods. The author is not an expert in onomastics or history so would like to apologise if any mistakes have been made. All names included are from genuine sources to the best of her knowledge, but this is not an academic study and should not be relied upon by re-enactment societies which require specific dates and instances of occurrence for the names they use.

This collection of names was compiled by Kate Monk and is ©1997, Kate Monk. Copies may be made for personal use only.


More names links which lists the above as a "site to avoid". Actually, all of the links are "sites to avoid" since they are "ompiled without reference to historical usage".

Cited sources include this one.

hugovk commented 10 years ago

GitHub repo tips

TXT / PDF

Append ?raw=true to URLs of things like text files to see the whole plain text file, and to PDFs to have them download immediately. For example, compare:

https://github.com/hugovk/meow.py/blob/master/meow-x2-pg2701.txt https://github.com/hugovk/meow.py/blob/master/meow-x2-pg2701.txt?raw=true

https://github.com/hugovk/meow.py/blob/master/meow-pg2701.pdf https://github.com/hugovk/meow.py/blob/master/meow-pg2701.pdf?raw=true

HTML / PDF

For HTML files, create a branch called gh-pages and then instead of: https://github.com/hugovk/lexiconstruct/blob/gh-pages/a-dictionary-of-not-a-words.html https://github.com/hugovk/lexiconstruct/blob/gh-pages/a-dictionary-of-not-a-words.html?raw=true You can have it hosted: https://hugovk.github.io/lexiconstruct/a-dictionary-of-not-a-words.html

And you can do the same with PDF. Compare: https://github.com/hugovk/gutengrep/blob/gh-pages/output/gutenstory.pdf?raw=true https://hugovk.github.io/gutengrep/output/gutenstory.pdf

And for simplicity's sake, if you go to the repo's settings, you can change the default branch from master to gh-pages, and then even delete master and just work on gh-pages.

hugovk commented 10 years ago

There's also https://rawgit.com/

RawGit serves raw files directly from GitHub with proper Content-Type headers.

Compare: https://github.com/moonmilk/nanogenmo2014/blob/master/novel.pdf?raw=true https://cdn.rawgit.com/moonmilk/nanogenmo2014/master/novel.pdf

zachwhalen commented 10 years ago

Here's a tip, in case anyone is looking for last-minute text source ideas. Maybe this is common knowledge, but I only found out how to do this recently ...

You know how Twitter search results only return tweets from the last week or so? They're making improvements on the web and desktop interface, but last time I checked, the Rest API was still loading the "one-week" index and not the fulll index you can access elsewhere.

So what to do? Well, topsy.com has a searchable full index of tweets that can be sorted by date. The web interface limited in various ways, though, and a "pro" account is super expensive. BUT the website's search interface gets its results from Topsy's API, and the request URL (search.js) includes an API key. You can query that directly and get JSON. You can also tweak the paramaters to get up to 100 results per chunk.

I thought maybe the API key was temporary since it's just exposed, but I've been using the same one for about a week.

Again, maybe this is common knowledge, but if not, hope it helps someone else!

ikarth commented 10 years ago

A non-text-generation resource: Noticed that we're getting near the end of the month and a couple people mentioned not knowing how to upload their source code. For those who want to upload their source code to Github, but have no idea how to use git, a couple of resources:

http://www.sourcetreeapp.com/ https://code.google.com/p/tortoisegit/

Both of these will give you a GUI that lets you interface with git without having to learn all of the commands. Makes it easy for a beginner to get started with source code control.

You can, of course, just put up a zip file with your source code, or post it as a gist, but I wanted everyone to be aware of some of the resources that make things a lot easier.

enkiv2 commented 10 years ago

This is a little late in the game, but maybe someone will do something with this next year. Stanford released the source for their image description project (for those who didn't hear about this, it hooks a neural net that's good at image classification up to a neural net that's good at text generation and then trains the pair on images and user-generated descriptions, with the result that it can produce a human-like description of an image). The source is here: https://github.com/karpathy/neuraltalk

Somebody could probably use this to generate stories out of individual frames of video.

On Sat Nov 29 2014 at 9:12:34 AM ikarth notifications@github.com wrote:

A non-text-generation resource: Noticed that we're getting near the end of the month and a couple people mentioned not knowing how to upload their source code. For those who want to upload their source code to Github, but have no idea how to use git, a couple of resources:

http://www.sourcetreeapp.com/ https://code.google.com/p/tortoisegit/

Both of these will give you a GUI that lets you interface with git without having to learn all of the commands. Makes it easy for a beginner to get started with source code control.

You can, of course, just put up a zip file with your source code, or post it as a gist https://gist.github.com/, but I wanted everyone to be aware of some of the resources that make things a lot easier.

— Reply to this email directly or view it on GitHub https://github.com/dariusk/NaNoGenMo-2014/issues/1#issuecomment-64953042 .

ikarth commented 10 years ago

For those of you who want to convert your text output to another format (and don't already have a library to do it in code) some tools that I've found:

Pandoc MultiMarkdown

ianmart1n commented 9 years ago

sorry this is very late but I was asked to bring these resources over here. maybe useful for next year? a list of nouns, and some word lists that are presumably hacker resource files... but i found the long list of names really exhaustive and useful.