TED-Parallel-Corpus

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages. It includes Monolingual corpus, 12 languages for Bilingual parallel corpus over 120 million aligned sentences and 13 languages for Multilingual Parallel corpus with more than 600k sentences. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. All pre-processing is done automatically. No manual corrections have been carried out.

Author

Mr. Ajinkya kulkarni, Contact: ajinkyakulkarni14@gmail.com

Multilingual Parallel Corpus :

12 languages aligned Parallel corpus data : It contains Parallel aligned sentences for 12 languages which encovers ar Arabic, zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, nl Dutch, fr French, de German, he Hebrew, it Italian, ja Japanese, ko Korean, ru Russian, es Spanish.

Sentences : 349049

4 languages aligned parallel corpus data: It contains Parallel aligned sentences for 4 South Asian languages which encovers zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, ja Japanese, ko Korean.

Sentences : 389764

Bilingual Parallel Corpus :

Language 1	Language 2	Sentences	Language 1	Language 2	Sentences
Russian	Spanish	523485	Korean	French	462616
Arabic	Hebrew	512358	Korean	Hebrew	485919
Dutch	Russian	442167	Spanish	Hebrew	486466
Arabic	Russian	555618	Dutch	Chinese, Traditional	406528
Hebrew	Spanish	486466	Hebrew	German	449485
Spanish	Chinese, Simplified	479771	Hebrew	French	464923
Spanish	Russian	523485	Hebrew	Italian	480730
Russian	Chinese, Simplified	533541	Russian	Italian	523015
German	Chinese, Traditional	438420	Dutch	Spanish	415347
Italian	Spanish	477021	Chinese, Simplified	Chinese, Traditional	464982
Spanish	French	463476	Chinese, Simplified	German	442415
Arabic	Dutch	411929	Korean	Spanish	486162
Chinese, Traditional	Arabic	473423	Hebrew	Dutch	415768
French	Italian	458939	German	Hebrew	449485
Russian	Dutch	442167	Chinese, Traditional	Italian	455363
Dutch	Italian	407669	Arabic	Italian	486628
Russian	Arabic	555618	Arabic	Chinese, Traditional	473423
Chinese, Traditional	Spanish	465481	Chinese, Traditional	Russian	506240
German	Chinese, Simplified	442415	Spanish	Dutch	415347
French	German	442292	Dutch	Hebrew	415768
Chinese, Simplified	French	458083	Spanish	German	452661
Arabic	Spanish	491987	Russian	Chinese, Traditional	506240
Chinese, Simplified	Dutch	406971	Hebrew	Chinese, Traditional	473169
German	Arabic	445899	Arabic	German	445899
German	Dutch	411134	Chinese, Simplified	Italian	473247
Italian	Chinese, Simplified	473247	Arabic	French	469558
Chinese, Traditional	Dutch	406528	Hebrew	Russian	541540
French	Hebrew	464923	Italian	Hebrew	480730
Hebrew	Arabic	512358	French	Arabic	469558
Chinese, Simplified	Hebrew	496348	Russian	Hebrew	541540
Hebrew	Chinese, Simplified	496348	German	Russian	479543
Chinese, Simplified	Arabic	502194	Spanish	Italian	477021
French	Chinese, Traditional	448751	Dutch	Arabic	411929
Italian	German	444088	Chinese, Traditional	German	438420
Dutch	Chinese, Simplified	406971	Spanish	Arabic	491987
Chinese, Traditional	Hebrew	473169	Russian	German	479543
German	French	442292	Chinese, Traditional	French	448751
Spanish	Chinese, Traditional	465481	Spanish	Korean	486162
Dutch	German	411134	French	Dutch	409715
Italian	Chinese, Traditional	455363	Italian	Dutch	407669
French	Russian	500195	French	Spanish	463476
German	Spanish	452661	Russian	French	500195
Chinese, Traditional	Chinese, Simplified	464982	Italian	Russian	523015
Arabic	Chinese, Simplified	502194	German	Italian	444088
French	Chinese, Simplified	458083	Italian	French	458939
Chinese, Simplified	Spanish	479771	Chinese, Simplified	Russian	533541
Hebrew	Korean	485919	Dutch	French	409715
French	Korean	462616	Italian	Arabic	486628

Monolingual Corpus :

Language	Sentences	Language	Sentences	Language	Sentences
Azerbaijan	20852	Swahili	7204	Assamese	57
Chinese, Yue	20940	Czech	272464	Khmer	614
Latgalian	9	Silesian	91	Norwegian Nynorsk	3012
Chinese, Simplified	507085	Basque	12303	Occitan	54
Algerian Arabic	1716	Macedonian	69086	Hupa	3
Belarusian	10965	Montenegrin	4181	Danish	128916
Macedo	3068	Finnish	61604	Igbo	68
Croatian	326967	Hungarian	398138	Asturian	232
Malayalam	7218	Punjabi	48	Serbian	359791
Turkish	433023	Russian	609744	Irish	256
Bulgarian	475860	Bislama	49	Kazakh	6993
Tagalog	2397	Afrikaans	2903	Filipino	7513
Nepali	4350	French	493026	Icelandic	4957
Vietnamese	349731	German	471902	Mongolian	19737
Albanian	148541	Esperanto	18966	French (Canada)	68316
Slovak	175052	Georgian	37013	Telugu	4104
Maltese	343	Latin	46	Serbo-Croatian	5239
Swedish Chef	375	Cebuano	203	Tamil	20805
Somali	4545	Uyghur	1410	Bosnian	20522
Hindi	48513	Galician	22368	Slovenian	63981
Tibetan	2085	Romanian	454412	Indonesian	236543
Catalan	89358	Lao	854	Tatar	277
Ingush	377	Ukrainian	282163	Kyrgyz	1480
Tajik	1147	Kannada	3716	Hausa	51
Arabic	553483	Gujarati	5636	Klingon	131
Amharic	1596	Italian	501685	Dutch	433318
Latvian	60171	Marathi	22345	Swedish	121479
Estonian	33236	Lithuanian	116956	Sinhala	1602
Creole, Haitian	417	Malagasy	729	Persian	362411
Uzbek	6201	Bengali	17107	Hebrew	535665
Pashto	491	Armenian	69923
Spanish	521162	Luxembourgish	217
Thai	237086	Portuguese, Brazilian	476576
Burmese	41266	Urdu	19861
Portuguese	250967	Chinese, Traditional	483199
Norwegian Bokmal	47441	Malay	23502

Author

Mr. Ajinkya kulkarni, Contact: ajinkyakulkarni14@gmail.com

Conditions of use

The TED-Multilingual-Parallel-Corpus contain text from publicly accessible source www.ted.com . All data have been processed automatically so that it is not possible to reconstruct the original source texts. They are made available on the condition that they may be used for scientific purposes only and not passed on to third parties. Any use of the data must be duly documented and referenced.

Disclaimer

The TED-Multilingual-Parallel-Corpus have been processed automatically from www.ted.com . accessible sources based on the outlined methodology without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors. For each word, the list of words that significantly co-occur with that word are computed on the basis of the available text and neither express a general fact of language nor the particular view of author for Natural Language Processing. Please let us know if you find problems with the data or if you want the data for other language pairs.

ajinkyakulkarni14 / TED-Multilingual-Parallel-Corpus

readme

TED-Parallel-Corpus

Author

Multilingual Parallel Corpus :

Bilingual Parallel Corpus :

Monolingual Corpus :

Author

Conditions of use

Disclaimer