ajinkyakulkarni14 / TED-Multilingual-Parallel-Corpus

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages.
243 stars 80 forks source link

why not english copus #1

Open ttjslbz opened 7 years ago

ttjslbz commented 7 years ago

Dear Sir Why not you prepare a english-foreign language corpus, i think this is the most common corpus for developer.

Regards

ghost commented 6 years ago

i also missed the english corpus. however it is easy to download any talk transcript in json format: wget --no-clobber --no-check-certificate https://www.ted.com/talks/2695/transcript.json?language=en

AmitMY commented 6 years ago

I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?

ajinkyakulkarni14 commented 6 years ago

I am reopening the project and going to update the corpus soon.

Bachstelze commented 5 years ago

How did you align the current corpus? Most of the alignment tools are based dictionaries or translations:

ghost commented 5 years ago

although the timestamps do not 100% match, you can use the timestamp to align the texts:

i did that for the english-hungarian to reconstruct the aligned sentences, and works pretty well for any language. there is no need for dictionaries or other tools.

here is an example: {"time":676000,"text":"A tuberkulózis előfordulási aránya Pine Ridge-ben"}, {"time":676814,"text":"The tuberculosis rate on Pine Ridge"}

and here are some of the search results of 'tuberculosis' from my index:

TEXTS:

ghost commented 5 years ago

I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?

the transcript is meant to be aligned to the speaker's voice. my experience is that the english-hungarian is pretty much aligned. i suppose the other ones also... it is easy to verify.

stanrunge commented 8 months ago

I am reopening the project and going to update the corpus soon.

Is there any update to this that I am failing to find?