An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
Our specific extraction logic is generally only correct for a given HTML spec -- e.g., HTML 2.5 changed how different filetypes are identified in the DOM. While most if not all things will be stable version-to-version (breaking changes should be rare), it would probably be good for our code to have a hard-coded parameter for what HTML spec it was built for that is compared to the HTML spec number in the article HTML to make sure they match (and maybe emits a warning message if there's a mismatch so folks know there may be errors).
In GitLab by @geohci on Sep 20, 2022, 16:34
Our specific extraction logic is generally only correct for a given HTML spec -- e.g., HTML 2.5 changed how different filetypes are identified in the DOM. While most if not all things will be stable version-to-version (breaking changes should be rare), it would probably be good for our code to have a hard-coded parameter for what HTML spec it was built for that is compared to the HTML spec number in the article HTML to make sure they match (and maybe emits a warning message if there's a mismatch so folks know there may be errors).