CXuesong / MwParserFromScratch

A basic .NET Library for parsing wikitext into AST.
Apache License 2.0
17 stars 5 forks source link

WikiText to HTML parser? #11

Closed rraallvv closed 6 years ago

rraallvv commented 6 years ago

Can MwParserFromScratch be used in a WikiText to HTML parser? Thanks.

CXuesong commented 6 years ago

Nope. It's beyond the scope of this project. Due to the inherent mess in the syntax of wikitext caused by its history, the AST generated by this parser can only roughly represent the input wikitext. You would need a completely different parsing logic for parsing it into HTML. That is, only by replacing the different markups into HTML over and over again, can you exactly simulate what the MediaWiki parser does.

My little parser is intended to be used by MediaWiki bots to analyze the structure of the wikitext, so something like AST would be handy for that, and a natural way to parse an AST out of the wikitext, is to write a recursive descent parser by hand. However, recursive descent parser works best for context-free grammar, while obviously, wikitext is not. So I did some heavy customization on it. Still it cannot handle the input in a bullet-proof fashion. (E.g. #1, #8)

And another trouble is caused by template expansion (a.k.a. transclusion). Actually the meaning of the same token can vary dramatically. For example, the Test in

{{L}} Test {{R}}

is rendered as a link if Template:L is [[ and Template:R is ]]. But if Template:R is def, then the whole line will be rendered as plain text [[ Test def. If Template:L is {{ and Template:R is }}, then we get yet another template for expanding ({{ Test }}). The point is, we cannot pass in the wikitext once or twice, and generate the final HTML. We need to parse it over and over again, until all the templates have been expanded. Yes we can do that, but even with such pain of parsing, we still cannot simulate exactly what MediaWiki outputs, because of the first problem I've mentioned above.

If you, or whom might be concerned, were to write a wikitext to HTML parser, I would suggest you throw recursive descent parsing away, and just do what MediaWiki do, i.e. applying regex substitutions over and over again. If you just want to show preview for a MediaWiki code snippets, use MediaWiki render API and let it does the parsing job for you. As an alternative way, you may search for HTML dumps for the WMF projects (though the dumps are rather dated now).

rraallvv commented 6 years ago

@CXuesong , I've been trying to use the MeadiaWiki API to render a small definition page from WikiText, but just when I thought I had something working I decided to try a different language XD

...well, for all the reasons you explained it didn't turn too well.

Thanks for the explanations, it really helped me out to clarify why everybody seems to agree with that one.

CXuesong commented 6 years ago

Well, in that case, I definitely suggest you to use something like markdown, instead of wikitext…

rraallvv commented 6 years ago

@CXuesong, The problem with trying to render pages from the server side with the WikiMedia API is that those get cluttered with unuseful stuff like for instance the table of contents and links to edit the sections, those that look like Some section [Edit]. Retrieving the content in raw WikiText appeared to be more manageable, as I said, until I tried to use the WkiText to HTML converter that I created for English in other languages. Since you suggested Markdown could be used instead I tried looking for something in the API to get Markdown from the server side, but couldn't fine anything related.

CXuesong commented 6 years ago

Okay I've got this wrong. Just ignore my last post. You are going to parse from wikitext anyway. So in this case, you may just need to take a look at disableeditsection and disabletoc parameter for the parse action.

rraallvv commented 6 years ago

@CXuesong , Adding those parameters really does help to strip the content I don't want, thanks.