appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Add logging to indicate mismatch between HTML spec version and html dumps version #44

Open appledora opened 2 years ago

appledora commented 2 years ago

In GitLab by @geohci on Sep 20, 2022, 16:34

Our specific extraction logic is generally only correct for a given HTML spec -- e.g., HTML 2.5 changed how different filetypes are identified in the DOM. While most if not all things will be stable version-to-version (breaking changes should be rare), it would probably be good for our code to have a hard-coded parameter for what HTML spec it was built for that is compared to the HTML spec number in the article HTML to make sure they match (and maybe emits a warning message if there's a mismatch so folks know there may be errors).