Open ttjslbz opened 7 years ago
i also missed the english corpus. however it is easy to download any talk transcript in json format: wget --no-clobber --no-check-certificate https://www.ted.com/talks/2695/transcript.json?language=en
I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?
I am reopening the project and going to update the corpus soon.
How did you align the current corpus? Most of the alignment tools are based dictionaries or translations:
although the timestamps do not 100% match, you can use the timestamp to align the texts:
i did that for the english-hungarian to reconstruct the aligned sentences, and works pretty well for any language. there is no need for dictionaries or other tools.
here is an example: {"time":676000,"text":"A tuberkulózis előfordulási aránya Pine Ridge-ben"}, {"time":676814,"text":"The tuberculosis rate on Pine Ridge"}
and here are some of the search results of 'tuberculosis' from my index:
TEXTS:
I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?
the transcript is meant to be aligned to the speaker's voice. my experience is that the english-hungarian is pretty much aligned. i suppose the other ones also... it is easy to verify.
I am reopening the project and going to update the corpus soon.
Is there any update to this that I am failing to find?
Dear Sir Why not you prepare a english-foreign language corpus, i think this is the most common corpus for developer.
Regards