How to get entire parallel text corpus after the titles.txt

clab / wikipedia-parallel-titles

Tools for extracting parallel corpora from article titles across languages in Wikipedia

72 stars 17 forks source link

How to get entire parallel text corpus after the titles.txt #7

Open StephennFernandes opened 4 years ago

StephennFernandes commented 4 years ago

this is not an issue. just asking for help or any reference script or any resources on how to parse entire parallel corpus of text for machine translation . do you have any scripts or any resources that you can please share to take the parallel titles as args and parse then into a text extractor to parse both the language texts form wikipedia.

I am building a machine translation system any help would be much appreciated .

Thanks

AriNubar commented 1 year ago

Hi, I also am interested in this. Have you found a script or written one, perhaps?

StephennFernandes commented 1 year ago

Hey, actually I abandoned the plans and landed building generic machine translation systems where generic parallel corpus were available.

But still this implementation would be a great, I'll add something here once i come up with any script

Btw try asking ChatGPT just incase