Pull in other corpora - Githubissues

RichardLitt commented 7 years ago

Ought we to pull in other corpora? For instance, https://github.com/joshuabragge/MarkovCookie.

kirkins commented 7 years ago

Wouldn't be too hard with a csv to json converter. The only thing is after you pull in 1 or 2 more corpus there will be the problem of possible duplicates.

Any ideas of how to automate checking for duplicates?

RichardLitt commented 7 years ago

Pretty easy to build a cleaner to turn all of the data into an array, and then filter out only unique entries. My question: do we want to bring in possibly bad corpora?

kirkins commented 7 years ago

I'm interested in linguistics but basically a complete noob.

What makes good corpora? Is it just being able to verify that a fortune was actually acquired through a real cookie? As opposed to someone who may have just made up a bunch of 'fake' fortune cookies?

RichardLitt commented 7 years ago

I think the answer to your question has less to do with linguistics and more to do with motivations.

I started this because @alvations, an old friend of mine from Germany, and I have been making fortune cookie corpora for years, by hand. It's really silly, but it's something that has given us a bit of joy - opening up cookies, and putting their translations somewhere. For me, I've mostly just been storing the fortunes - I have hundreds hidden away around my apartment and in storage. I'm sure @alvations has the same.

The idea wasn't, necessarily, to make a giant list of all fortunes. If we wanted that, we could use the fortune command in UNIX. Or we could pull in one of the many - dozens, at least - corpora on the net. But I'm not sure that would help the original goal - having fun with translated corpora of fortune cookies we make ourselves.

I also don't know about the quality of other corpora. This is an easy excuse - made above - for not pulling in dozens of others. It's not like Chinese fortune cookies are particularly well done, anyway - on the whole, they tend to be rife with spelling mistakes and bad grammar (that's part of the fun). But I don't want to pull in fortunes that weren't originally from actual paper fortunes. Of course, you could hypothetically have already added those here (or me), and I wouldn't know, so that's kind of a facile argument.

The real question is: Do we want to make this a giant corpus, or do we want to make it a fun corpus?

kirkins commented 7 years ago

Interesting I had never heard of the concept before seeing your repo.

I was attracted by the fun aspect as well. I can assure you that all of my fortunes are from paper. I was feeling pretty sick after all those cookies.

alvations commented 7 years ago

Hohoho, I haven't even opened that 1 year of Gmail archive worth of horoscope.com emails for all 12 zodiacs, that'll give enough "fortunes" for substantial work. But munging it might take 1-2 whole weekend.

Do we want to make this a giant corpus, or do we want to make it a fun corpus?

Of course, the answer is GIANT FUN CORPUS!

The proper way to do this is to turk them if we were to do actual fortune cookies to computerized text. If there's a website/app to take photo of the fortune paper and upload it somewhere. Then we can easily use OCR on the image and volia corpus data =)

But knowing how to incentivize people for uploading their fortune might be hard. Possibly making it into a game with levels like what "Google Local Guide" does is a the right line of thought.

Imagine if the app becomes the niche "instagram" for people to share their cookies while we collect the corpus and allow people to automatically generate things like https://github.com/nprapps/quotable or memes with the quote ;P

RichardLitt commented 7 years ago

But munging it might take 1-2 whole weekend.

Probably worth it.

Of course, the answer is GIANT FUN CORPUS!

Problem solved then. @kirkins - feel free to import whatever you like. In a PR would be great (even though you can merge), so that we can check how it's validated for now (let me know if that doesn't make sense). As for me, I'm not the biggest fan of shimming data, so I am going to keep putting things in by hand.

If there's a website/app to take photo of the fortune paper and upload it somewhere. Then we can easily use OCR on the image and volia corpus data

This would not be hard to build, no? Would be pretty sweet. A game might be fun, actually. Do people use these fortunes in China? Their market is larger.

RichardLitt / fortune-cookie-corpus

Pull in other corpora #1