Separate documents? - Githubissues

fsimonjetz commented 7 years ago

The readme says "All data have been processed automatically so that it is not possible to reconstruct the original source texts." I'm considering to use German-Korean data for my PhD project; however, for what I have in mind it would be helpful to have the documents separated. Is this information available? Even stand-off indices would be nice.. I hope you can keep up this project, it looks like a promising resource!

ajinkyakulkarni14 commented 7 years ago

@fsimonjetz Thank you for your interest in this project.

I would suggest you to use following script to generate your own pair of parallel data, https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus-

If you are still not been able to extract it, let me know.

ngovinhtn commented 6 years ago

@ajinkyakulkarni14, I use https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus- to extract data for en-ja, but it get error: Traceback (most recent call last): File "extractTEDtalk.py", line 25, in all_talk_names=enlist_talk_names(path,all_talk_names) File "extractTEDtalk.py", line 13, in enlist_talk_names r = urllib.request.urlopen(path).read() File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.5/urllib/request.py", line 472, in open response = meth(req, response) File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python3.5/urllib/request.py", line 510, in error return self._call_chain(args) File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain result = func(args) File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 429: Rate Limited too many requests.

Can you help me to solve the error, please? Thank you!

ajinkyakulkarni14 commented 6 years ago

I am reopening the project and going to update the corpus soon.

ajinkyakulkarni14 / TED-Multilingual-Parallel-Corpus

Separate documents? #2