Omniglot crawl/cleaner - Githubissues

alvations commented 10 years ago

I've uploaded the base omniglot crawler/cleaner code. The get_phrases() somehow already crawled and cleaned the multilingual phrases from http://www.omniglot.com/language/phrases/langs.htm#lang and it's in sugarlike/data/omniglot/omniglot-phrases.tar

@Ga: we'll look through on mon/tues (25-26.11.13) to clean or crawl+clean the rest of the parsable or single pages.

guyemerson commented 10 years ago

I think the phrases and the Babel stories are the most important things on Omniglot, since they cover a lot of languages.

Colours and time expressions are probably worth looking at, but we're mainly getting individual words here.

Idioms, tongue twisters, kinship terms, and signs might also be worth looking at, but these already have lower coverage across languages.

UDHR can be ignored since we can get the full text elsewhere. The other ones are getting really sparse and I don't know whether there's much point looking at them.

alvations commented 10 years ago

the test module for omniglot phrases is pushed to repo. i've added the test capabilities to the get_phrases().

@Ga, can you take a look at how to clean the babel stories from the html source? I'll work on the odin parser. now that the udhr is settled. =)

guyemerson commented 10 years ago

We also need to sort out line breaks and slashes properly. Currently, slashes are being kept in, and some lines are being stuck together without an intervening space.

I think the simplest fixes would be: 1. Add a space when removing a line break. 2. Always split on slashes, and remove the slash.

Ideally, we should check if a line break is supposed to separate phrases or not, but I suspect this might be something that is done inconsistently - the site is supposed to be human readable, not machine readable, after all.

alvations commented 10 years ago

are you referring to the phrase pages or the babel story pages?

guyemerson commented 10 years ago

I was referring to the phrase pages.

Have you crawled the babel story pages yet? (In other words, do I need to crawl and clean the pages, or just clean what's already been downloaded?)

alvations commented 10 years ago

ah, for get_phrases() , the slashes can be easily removed before sending it to the feature extraction model in training. data-wise, it should be okay to keep the tarfile as it is (closest to the data source but still process-able before the model building).

for get_babel(), I suggest clean while crawling, that way it's easier to avoid passing the data multiple time and someone encoding error will creep in when the file is opened or saved in non-utf8 formats. You can take a look at urllib2.urlopen() http://docs.python.org/2/library/urllib2.html or at the base code i've written for the get_phrases().

Cleaning with beautifulsoup while crawling sorts of also slow down the crawls per second so i went with that approach for get_phrases(). And more importantly, we dont explode the size of the repo by working on statically downloaded htmls =)

guyemerson commented 10 years ago

Okay, but we still need to fix the line breaks.

alvations commented 10 years ago

no worries, when you have s = u'foo bar \n black sheep\n bar bar', print unicode(s.replace('\n','') would give you the flat text and map(unicode.strip(),s.split('\n')) would give the individual phrases in a list =)

guyemerson commented 10 years ago

Also, I wasn't suggesting passing the data multiple times, I just meant downloading everything once, and then pre-processing it once. This way we wouldn't have to crawl the site each time we tweak how we process the data.

If you think it's better to crawl each time, then we can. (Omniglot is probably small enough for that, anyway, right?)

alvations commented 10 years ago

it depends on the motive, i think it's better to work with on-the-fly or compressed data rather than working individual files which causes problem when you want to path through a directory to read files. if you have a combinations of .tar and .htm files, it wouldn't be much of a problem but nevertheless, i wouldn't recommending putting up the crawled .htm in repo.

@Ga: Run crawl_babel_pages() in omniglot.py and that should crawl all the pages you need and put into sugarlike/data/omniglot/babel/. But i still think that when you're pushing the cleaned data up into the repo, it should either be in pickled or tarball file, so that we keep one file per data type (easier to maintain when training models) =)

alvations commented 10 years ago

@alvations, please add data interface functions to access the omniglot babel data!!

alvations / sugali

Omniglot crawl/cleaner #3