ajinkyakulkarni14 / TED-Multilingual-Parallel-Corpus

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages.
243 stars 80 forks source link

Separate documents? #2

Open fsimonjetz opened 7 years ago

fsimonjetz commented 7 years ago

The readme says "All data have been processed automatically so that it is not possible to reconstruct the original source texts." I'm considering to use German-Korean data for my PhD project; however, for what I have in mind it would be helpful to have the documents separated. Is this information available? Even stand-off indices would be nice.. I hope you can keep up this project, it looks like a promising resource!

ajinkyakulkarni14 commented 7 years ago

@fsimonjetz Thank you for your interest in this project.

I would suggest you to use following script to generate your own pair of parallel data, https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus-

If you are still not been able to extract it, let me know.

ngovinhtn commented 6 years ago

@ajinkyakulkarni14, I use https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus- to extract data for en-ja, but it get error: Traceback (most recent call last): File "extractTEDtalk.py", line 25, in all_talk_names=enlist_talk_names(path,all_talk_names) File "extractTEDtalk.py", line 13, in enlist_talk_names r = urllib.request.urlopen(path).read() File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.5/urllib/request.py", line 472, in open response = meth(req, response) File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python3.5/urllib/request.py", line 510, in error return self._call_chain(args) File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain result = func(args) File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 429: Rate Limited too many requests.

Can you help me to solve the error, please? Thank you!

ajinkyakulkarni14 commented 6 years ago

I am reopening the project and going to update the corpus soon.