clab / wikipedia-parallel-titles

Tools for extracting parallel corpora from article titles across languages in Wikipedia
72 stars 17 forks source link

how to parallel articles #3

Open Tellyang7 opened 4 years ago

Tellyang7 commented 4 years ago

There is no doubt that this work is very powerful and great. And I also successfully implemented the Chinese to English transfer operation. My question is that the text content in the titles is too small. Is there any way to convert the content of the article? How should I operate?

VP007-py commented 4 years ago

@wammar any updates on this?

Since the Title lengths are too small in most of the cases it wouldn't suffice to build a well crafted MT system

bittlingmayer commented 4 years ago

Automatically aligning sentence pairs is a non-trivial task, perhaps a few orders of magnitude larger than this repo.

For getting sentence pairs automatically aligned from within articles, I recommend WikiMatrix.

See https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix and https://ai.facebook.com/blog/wikimatrix/

The main downside in my view is that it doesn't cover low-resource languages/pairs.

(There is also the newer and larger CCMatrix in the same repo, but it's extraction script is not ready yet, and it covers even fewer language pairs.)

wammar commented 4 years ago

Sorry for the slow reply and thanks for the suggestion @Tellyang7! Expanding this to cover article content will definitely give richer text, at the expense of higher complexity in deciding which parts are parallel. I'm not actively working on this but feel free to contribute a new script and I'd be happy to merge.