Closed sneumann closed 2 years ago
Thanks for spoting. Impressive how relaxed popular browser handle broken html syntax. I fixed it in dev and rolled out on ipb MassBank. I checked with https://validator.w3.org. Please note: xmllint is not exactly made for html. If you really want to scrape a html with xmllint use -html
. Even in html mode xmllint complains a bit about some html5 tags and heavily about inline svg. But I know no better solution. My suggestion: use -html and pipe stderr to /dev/null.
wget -q -O- https://msbi.ipb-halle.de/MassBank/RecordDisplay?id=PB000123 | xmllint -html --xpath '//html/body/header' - 2> /dev/null
We have some broken HTML structure, which prevents some clients from scraping content.
I think the
<meta charset="UTF-8">
should be<meta charset="UTF-8"/>
(closing/
).Yours, Steffen