Open dnk8n opened 3 years ago
I was nearly going to help out with a fix until I saw that parsing is happening in regex.
Why not parse with an XML parser? Since it is XML...
Clearly the wrong group is being captured.
If regex is really required, then this library could do with some simple tests to check that what is derived for these values with XML parsing, matches.
I am unsure if I caught an anomoly or if it is wrong with consistancy.
Funnily enough, I fell into exactly the same trap while writing my own parsing of the wiki files... about to work it out. I don't use regex though, I use xml.sax... so my solution will not be applicable.
When parsing the following document, incorrect metadata is found.
revid should read 987507844 (link to correct article - https://en.wikipedia.org/wiki?curid=20460173&oldid=987507844 which is current at time of writing.
Once fixed, I would propose that an option could be supplied to fix links with revid, so that in future the outdated data will not drift away from the link supplied.
Erroneous result (using the --json flag, not tested without):
Original XML: