dariusk / corpora

A collection of small corpuses of interesting data for the creation of bots and similar stuff.
4.95k stars 1.3k forks source link
bots corpus language words

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

List of Corpora-related tools

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

Contributors

By Darius Kazemi and Many Wonderful Contributors.