FlominatorTM / wikiblame

http://wikipedia.ramselehof.de/wikiblame.php
GNU General Public License v3.0
54 stars 13 forks source link

HTML start_token is outdated #50

Closed kidhanis closed 3 months ago

kidhanis commented 5 months ago

I'm currently getting matches to JS code inside HTML script tags on English Wikipedia, and it's because $start_token inside chop_content() is not working. https://github.com/FlominatorTM/wikiblame/blob/64a254548d06d844ce435b58d039039e49abaeab/shared_inc/wiki_functions.inc.php#L318

The article data now starts with <div class="mw-content-ltr mw-parser-output", but there's also <div class="mw-content-rtl mw-parser-output" on RTL scripts.

tacsipacsi commented 5 months ago

The most future-proof solution would be using the API: https://en.wikipedia.org/w/api.php?action=parse&page=API&prop=text&disableeditsection=&formatversion=2 gives approximately the same result (including the removal of [bearbeiten] links, but in all languages), but the output is generally stable.

FlominatorTM commented 3 months ago

Thanks for the issue @kidhanis and for the suggestion @tacsipacsi, which I implemented