Open ashleysommer opened 1 month ago
I would say yes! It is difficult to understand what that library is doing and for what reason though.
At the moment I get for every HTML element that I am processing within RDFlib an error message that resembles something like this (example for Doctype node):
ile "C:\
This seriously delays processing of any HTML document as every element has to undergo this treatment. I am trying to finish my work on the HTML vocabulary (see https://www.w3.org/community/htmlvoc/) and a proper open source based implementation of the HTML vocabulary using RDFlib/PyShacl is number one on my list for more than a year. Would be awesome if you could fix this permanently. From your post I gather that you also do not think there are any undesired effects of removing html5lib. I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL?
@floresbakker Thanks for your input on this.
After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:
1) Take my six-less fork of html5lib
(thats called html5lib-modern
) that's causing these new packaging errors, rename it to html5rdf
, change its module name from html5lib
to html5rdf
to avoid aliasing, bring it under the rdflib org umbrella.
2) change usages of html5lib
in rdflib to use html5rdf
3) Make it an optional dependency again, gated behind the [html] extra.
4) Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html5rdf is not installed.
I trust we can keep using the datatype
rdf:HTML
for html literals in our RDF/SPARQL?
Yes, when html5rdf support is disabled, or even if we remove the feature entirely, then rdf:HTML
literals will be simply treated as a typed string literal, like any other typed string literal.
@floresbakker Thanks for your input on this.
After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:
- Take my six-less fork of
html5lib
(thats calledhtml5lib-modern
) that's causing these new packaging errors, rename it tohtml5rdf
, change its module name fromhtml5lib
tohtml5rdf
to avoid aliasing, bring it under the rdflib org umbrella.- change usages of
html5lib
in rdflib to usehtml5rdf
- Make it an optional dependency again, gated behind the [html] extra.
- Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html4rdf is not installed.
It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed? For extra information: the error message that I reported above was already present in the original html5lib before you made the html5lib-modern. Perhaps this helps in understanding the cause. I trust the html4rdf reference is a typo and should be html5rdf?
It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?
Not necessarily. You can link of html5rdf
as a new project, forked from html5lib
specifically for the use in the lexical-to-value mapping of rdf:HTML
Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings into domnodes
(aka DocumentFragement
objects in Python).
It will be maintained by the RDFLib team for that purpose, for the use in RDFLib
only.
As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?
It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?
Not necessarily. You can link of
html5rdf
as a new project, forked fromhtml5lib
specifically for the use in the lexical-to-value mapping ofrdf:HTML
Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings intodomnodes
(akaDocumentFragement
objects in Python).It will be maintained by the RDFLib team for that purpose, for the use in
RDFLib
only.As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?
I tried reproducing the errors on the newest release 7.1.1 from yesterday, but I was to my surprise unable to do so. That is good news for the htmlvoc project. I think I have only one remaining (unrelated to this discussion) issue, being unable to process trig files in RDFlib/PyShacl, for which I will work out a minimal working example. Thanks Ashley! There is a lot of movement within RDFlib/PyShacl, which is greatly appreciated.
Possible easy solution for #2935 and #2945
The reason we forked
html5lib
to makehtml5lib-modern
was because there is no new replacement forhtml5lib
that provides the same XML-based HTML-tokenizing functionality thathtml5lib
does. There's no alternative to move to.Beautifulsoup4 is the logical replacement, but it includes
html5lib
in its dependency tree, so defeats the whole point.But what if we just dropped that feature entirely? Why does RDFLib even want to be able to tokenize HTML Literals? The feature was added for a reason, but do we need to keep it?
Can we simply drop that feature, and treat HTML the same as any other string literal, and remove
html5lib
from our dependencies entirely?