RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 559 forks source link

Removal of specialized HTML literal handling? #2946

Open ashleysommer opened 1 month ago

ashleysommer commented 1 month ago

Possible easy solution for #2935 and #2945

The reason we forked html5lib to make html5lib-modern was because there is no new replacement for html5lib that provides the same XML-based HTML-tokenizing functionality that html5lib does. There's no alternative to move to.

Beautifulsoup4 is the logical replacement, but it includes html5lib in its dependency tree, so defeats the whole point.

But what if we just dropped that feature entirely? Why does RDFLib even want to be able to tokenize HTML Literals? The feature was added for a reason, but do we need to keep it?

Can we simply drop that feature, and treat HTML the same as any other string literal, and remove html5lib from our dependencies entirely?

floresbakker commented 1 month ago

I would say yes! It is difficult to understand what that library is doing and for what reason though.

At the moment I get for every HTML element that I am processing within RDFlib an error message that resembles something like this (example for Doctype node):

ile "C:\\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 247, in mainLoop new_token = phase.processDoctype(new_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 417, in processDoctype self.parser.parseError("unexpected-doctype") File "C:\\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 322, in parseError raise ParseError(E[errorcode] % datavars) html5lib.html5parser.ParseError: Unexpected DOCTYPE. Ignored.

This seriously delays processing of any HTML document as every element has to undergo this treatment. I am trying to finish my work on the HTML vocabulary (see https://www.w3.org/community/htmlvoc/) and a proper open source based implementation of the HTML vocabulary using RDFlib/PyShacl is number one on my list for more than a year. Would be awesome if you could fix this permanently. From your post I gather that you also do not think there are any undesired effects of removing html5lib. I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL?

ashleysommer commented 1 month ago

@floresbakker Thanks for your input on this.

After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:

1) Take my six-less fork of html5lib (thats called html5lib-modern) that's causing these new packaging errors, rename it to html5rdf, change its module name from html5lib to html5rdf to avoid aliasing, bring it under the rdflib org umbrella. 2) change usages of html5lib in rdflib to use html5rdf 3) Make it an optional dependency again, gated behind the [html] extra. 4) Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html5rdf is not installed.

ashleysommer commented 1 month ago

I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL?

Yes, when html5rdf support is disabled, or even if we remove the feature entirely, then rdf:HTML literals will be simply treated as a typed string literal, like any other typed string literal.

floresbakker commented 4 weeks ago

@floresbakker Thanks for your input on this.

After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:

  1. Take my six-less fork of html5lib (thats called html5lib-modern) that's causing these new packaging errors, rename it to html5rdf, change its module name from html5lib to html5rdf to avoid aliasing, bring it under the rdflib org umbrella.
  2. change usages of html5lib in rdflib to use html5rdf
  3. Make it an optional dependency again, gated behind the [html] extra.
  4. Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html4rdf is not installed.

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed? For extra information: the error message that I reported above was already present in the original html5lib before you made the html5lib-modern. Perhaps this helps in understanding the cause. I trust the html4rdf reference is a typo and should be html5rdf?

ashleysommer commented 3 weeks ago

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?

Not necessarily. You can link of html5rdf as a new project, forked from html5lib specifically for the use in the lexical-to-value mapping of rdf:HTML Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings into domnodes (aka DocumentFragement objects in Python).

It will be maintained by the RDFLib team for that purpose, for the use in RDFLib only.

As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?

floresbakker commented 3 weeks ago

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?

Not necessarily. You can link of html5rdf as a new project, forked from html5lib specifically for the use in the lexical-to-value mapping of rdf:HTML Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings into domnodes (aka DocumentFragement objects in Python).

It will be maintained by the RDFLib team for that purpose, for the use in RDFLib only.

As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?

I tried reproducing the errors on the newest release 7.1.1 from yesterday, but I was to my surprise unable to do so. That is good news for the htmlvoc project. I think I have only one remaining (unrelated to this discussion) issue, being unable to process trig files in RDFlib/PyShacl, for which I will work out a minimal working example. Thanks Ashley! There is a lot of movement within RDFlib/PyShacl, which is greatly appreciated.