bencabrera / grawitas

Grawitas is a lightweight, fast parser for Wikipedia talk pages that takes the raw Wikipedia-syntax and outputs the structured content in various formats.
MIT License
6 stars 4 forks source link

Multilingual support #9

Open matanox opened 6 years ago

matanox commented 6 years ago

This is half a note to myself, about turning this codebase, or at least the CLI portion, working for at least one non-English Wikipedia (Hebrew). Currently the CLI does not crash, but extracts zero conversations from the entire Hebrew wikipedia XML dump.

For good reasons beyond my current knowledge, people have decided to use language specific title-tags indicating a page being a "Talk page", rather than keeping one uniform word for the same semantic indication, across Wikipedia sites. Alas, this over-localization is probably natural to Wikipedia editors.

A simple linux grep shows that the current codebase is in fact English-specific, whereas Wikipedia is highly localized into a plethora of languages, each language having its own Wikipedia site prefixed by the language two-letter code, collaborated by a community of translators and editors. In fact Wikipedia pushes hard to be a multi-language thing.

Cloning the repo, it is easy to see the English-specificity of the code: grep -r "Talk:" src yields many results ―

src/core/output/formats.cpp: if(title.substr(0, 5) == "Talk:") src/core/parsing/xmlDumpParserWrapper.cpp: return title.substr(0,5) == "Talk:"; src/crawler/talkpageFetcher.cpp: if(title.substr(0,5) == "Talk:") src/crawler/talkpageFetcher.cpp: parameters << "Talk:" << QUrl::toPercentEncoding(QString::fromStdString(title)).toStdString(); src/crawler/crawling.cpp: if(title.substr(0,5) == "Talk:")

In contrast, judging from at least one language, namely Hebrew, it is obvious that at least the Wikipedia dump files (and likely the Wikipedia api and the source data), use a localized name which is a localized translation of the word Talk, to signal a Talk page: for the Hebrew wikipedia this happens to be the word "שיחה".

Which comes to explain why no conversations get extracted from the Hebrew Wikipedia dump file....

Further, it appears as if these names are defined in language-specific localized namespaces, such as this namespace which applies to the Hebrew Wikipedia; we can crisply observe the mapping between the English and Hebrew namespaces, showing in this last page. With quick help from Google Translate, it might seem as if Spanish has its own namespaces too although no mapping to the English namespace is present in this particular table.

I'm unaware whether there's a data source (DBPedia? Wikibase?) holding the canonical per-language mapping of the namespaces; Someone working for Wikipedia might better know. But as a first step I'll be replacing the English word with the Hebrew one in my fork of the repo, see if this makes the CLI parser "catch" conversations from a recent dump of the Hebrew Wikipedia.

I think this is a great project, and would become even more valuable and useful if it were to become working for additional languages. I hope to post an update on my continued experimentation.

matanox commented 6 years ago

Update: merely changing the CLI-relevant lines to use the localized prefix "שיחה:" instead of "Talk:" ―

src/core/output/formats.cpp: if(title.substr(0, 5) == "Talk:") src/core/parsing/xmlDumpParserWrapper.cpp: return title.substr(0,5) == "Talk:";

has no effect over a recent Hebrew wikipedia dump... still nothing gets extracted. More investigation is due...

bencabrera commented 6 years ago

I agree that it would certainly be great to add other languages to the parser. However, as always, problems arise from the details. You already found the parts where we filter articles to only include talk pages. Also currently we are only contacting the English Wikipedia Api in the Crawler, so we would have to change that. However, the biggest problem is certainly the parser itself. Since it is based on a grammar we would have to rewrite (part of) it for every new language we include. This should not be too hard for languages that are structurally similar to Englisch (German, all Roman languages etc.) but could be harder for example for Hebrew.

However, this is not to say that this is not possible and I would be more than happy to help. First place to start would be the grammar currently residing in src/core/parsing/grammars/.

matanox commented 6 years ago

Thanks a lot Benjamin. What concerns me most is that the research article mentions that Talk pages do not share a uniform structure, as they somehow depend both on the person setting up the Talk page (?!) and the orderly input of the people making Talk comments on the page according to convention (?). These lead me to think that the existing grammars should in a way be considered "heuristic", as they'd only work for some cases given the Talk page conversations don't strictly follow a formal language. or even that re-crafting them for a new language might be nearly as laborious as the initial grammar development itself. I guess the actual truth is somewhere in between...

Could you perhaps comment about these tentative understandings? I think it can be greatly helpful.

That aside, I agree on all accounts with your comment, and I suggest it might be enough to only localize the CLI portions, for many use cases.

Thanks again!

amire80 commented 6 years ago

Hi,

@matanster asked for my input here. I've been a Wikipedian for thirteen years, and a MediaWiki developer for some of that time.

First, about identifying talk pages in other languages. This should be fairly simple. Namespaces in MediaWiki have not only names, but also numbers. The standard namespaces that are present on all installations have standard numbers. You can see them here: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/Defines.php (search for NS_MAIN and examine the lines after it).

I haven't looked at the dumps lately, but as far as I can recall, the namespace number is saved as a separate field in the XML dump with every page element. Instead of looking at the namespace name in the page title, you should look at the number. If the namespace number is not included in the dump... then this sucks, and it's a facepalm for as MediaWiki developers. Perhaps the dump processing code in Pywikibot can give more clues as to how to filter pages by namespace.

amire80 commented 6 years ago

Now, about talk page structure.

Wikipedia started in 2001. That is, before modern social networks. Even forum sites were not as common back then: for example, phpBB started just a bit earlier, in 2000. Perhaps if phpBB would have been more mature when Wikipedia had started, Wikipedia's developers would just integrate it into the Wikipedia code, but for whatever reason they didn't feel the need to do something like this then, and made talk pages' structure super-flexible, as if having almost no structure at all.

This flexibility has advantages, and lots of experienced Wikipedians love it, but in the long run it has major disadvantages as well. The biggest one is that Wikipedia's talk pages are very different from discussion spaces on all other websites, so people who join Wikipedia as editors cannot reuse their skills from other sites, and have to learn something new and weird. And the other disadvantage is what you are encountering here: it's almost impossible for software to process talk pages properly. Software can make some educated guesses, but there is no complete solution for truly structured parsing of talk pages.

There have been two projects to make talk pages more structured: LiquidThreads, which is definitely discontinued, and Structured Discussions (a.k.a. Flow), which is still in development, but not expected to replace talk pages in a massive way any time soon.

amire80 commented 6 years ago

Oh, another comment about namespace names and number: It may seem like even namespaces numbers (0, 2, 4, etc.) are for content namespaces and odd numbers (1, 3, 5, etc.) are for talk namespaces. This is true for the namespaces created by default, but don't rely on it. There are wikis in which talk namespaces have even numbers. So always use explicit numbers and don't assume anything about even and odd numbers. Some programmers do this and then they are surprised that their code fails on other wikis.

bencabrera commented 6 years ago

Hi amire80,

thanks for your input. As you already mentioned, the main problem I see is in the lose talk page structure. Our main contribution in grawitas is a grammar (https://en.wikipedia.org/wiki/Parsing_expression_grammar) that is tailored to extract structured comments from English talk pages. I guess it should not be too hard to change the grammar to also accept other languages of a similar nature (e.g. Romance languages like Spanish, Italian, ..., or something like German). However, languages that have a very different syntax like e.g. Hebrew would be harder because for the whole grammar has to be rewritten. Its not completely off the charts but would take some time.

I can currently not focus on adding more languages to grawitas. However, if someone wants to start I would be happy to give some hints and provide feedback etc.

Ben