why not english copus - Githubissues

ttjslbz commented 7 years ago

Dear Sir Why not you prepare a english-foreign language corpus, i think this is the most common corpus for developer.

Regards

ghost commented 6 years ago

i also missed the english corpus. however it is easy to download any talk transcript in json format: wget --no-clobber --no-check-certificate https://www.ted.com/talks/2695/transcript.json?language=en

AmitMY commented 6 years ago

I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?

ajinkyakulkarni14 commented 6 years ago

I am reopening the project and going to update the corpus soon.

Bachstelze commented 5 years ago

How did you align the current corpus? Most of the alignment tools are based dictionaries or translations:

https://github.com/danielvarga/hunalign
https://github.com/rsennrich/bleualign or are complex like https://github.com/anoidgit/yasa More promising is a multilingual embedding, but this seems to be very hardware intensive: https://github.com/facebookresearch/LASER/tree/master/tasks/bucc

ghost commented 5 years ago

although the timestamps do not 100% match, you can use the timestamp to align the texts:

i did that for the english-hungarian to reconstruct the aligned sentences, and works pretty well for any language. there is no need for dictionaries or other tools.

here is an example: {"time":676000,"text":"A tuberkulózis előfordulási aránya Pine Ridge-ben"}, {"time":676814,"text":"The tuberculosis rate on Pine Ridge"}

and here are some of the search results of 'tuberculosis' from my index:

TEXTS:

But let's stick first to TUBERCULOSIS. -> De maradjunk a tbc-nél. (bart weetjens how i taught rats to sniff out land mines.txt)
Let's consider the big three: HIV, malaria, TUBERCULOSIS. -> Nézzük csak meg a nagy hármast: HIV, malária, tuberkulózis. (mark kendall demo a needle free vaccine patch that s safer and way cheaper.txt)
This is more than HIV/AIDS, malaria and TUBERCULOSIS combined. -> Ez több, mint a HIV/AIDS, malária és tuberkolózis együtt. (josette sheeran ending hunger now.txt)
She herself was suffering from HIV; she was suffering from TUBERCULOSIS. -> A lány szenvedett a HIV-től, szenvedett a tuberkulózistól. (gordon brown.txt)
I began documenting the close connection between HIV/AIDS and TUBERCULOSIS. -> Elkezdtem dokumentálni a szoros kapcsolatot HIV/AIDS és tüdőbaj fertőzés között. (james nachtwey s searing pictures of war.txt)
It would give us an unfair advantage against battling HIV/AIDS, TUBERCULOSIS and other epidemics. -> Ez hallatlanul nagy előnyhöz juttatna minket a HIV/AIDS, a tuberkulózis és más járványok elleni harcban. (andreas raptopoulos no roads there s a drone for that.txt)
So it was the spread of TUBERCULOSIS and the spread of cholera that I was responsible for inhibiting. -> Így a tuberkulózis és a kolera terjedésének megállításáért voltam felelős. (gary slutkin let s treat violence like a contagious disease.txt)
The TUBERCULOSIS rate on Pine Ridge is approximately eight times higher than the US national average. -> A tuberkulózis előfordulási aránya Pine Ridge-ben nagyjából nyolcszor magasabb, mint az amerikai nemzeti átlag. (aaron huey.txt)
He was haunted by the loss of his mother and his wife, who both died of TUBERCULOSIS at the age of 24. -> Az édesanyja és a felesége halála kísértette, akik mindketten tuberkulózisban haltak meg, 24 évesen. (scott peeples why should you read edgar allan poe.txt)

ghost commented 5 years ago

I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?

the transcript is meant to be aligned to the speaker's voice. my experience is that the english-hungarian is pretty much aligned. i suppose the other ones also... it is easy to verify.

stanrunge commented 8 months ago

I am reopening the project and going to update the corpus soon.

Is there any update to this that I am failing to find?

ajinkyakulkarni14 / TED-Multilingual-Parallel-Corpus

why not english copus #1