RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 558 forks source link

`html5lib-modern` dependency introduces silent dependency conflict with packages requiring `html5lib` #2935

Open mgorny opened 4 weeks ago

mgorny commented 4 weeks ago

The html5lib-modern fork installs a html5lib package with metadata named html5lib-modern. As a result, this package is not recognized by pip as satisfying a html5lib dependency. If one installs rdflib and then another package requiring html5lib, pip will overwrite the html5lib Python package with html5lib == 1.1, but at the same time preserve the metadata claiming that html5lib-modern == 1.2 is installed.

To reproduce, you can create a fresh venv and try e.g.:

$ pip install -q rdflib pyspelling
$ pip list | grep html5lib
html5lib        1.1
html5lib-modern 1.2
$ python -c 'import html5lib; print(html5lib.__version__)'
1.1
ashleysommer commented 3 weeks ago

@mgorny Thanks for raising this. The old html5lib module is deprecated, and should not be used in new releases, hence the shift to html5lib-modern.

Indeed it is unfortunate that pip will overwrite it when you install an older version of html5lib. What else in your project needs the old version of html5lib?

Edit: I just noticed its pyspelling.

The good news is it shouldn't make any difference to the operation of rdflib. The main difference between html5lib and html5lib-modern is the removal of the six dependency. But pyspelling reintroduces that module to your dependency tree anyway. So html5lib v1.1 should still work without issue even in RDFLib v7.1.0 release.

mgorny commented 3 weeks ago

I was just giving an example. I'm packaging for Gentoo, so html5lib is required by cppman, sigil, xml2rfc, beautifulsoup4, bleach, mechanize, pandas, pyspelling, weasyprint, hydrus, gpodder, buku, soupsieve, sphinxcontrib-htmlhelp, sphinxygen, textX, qtwebengine.

Unlike plain pip, Gentoo's package manager does not allow for conflicting files, so it is entirely impossible to install both packages. If you install html5lib, rdflib will fail because of missing dependency. If you install html5lib-modern, everything else will fail.

Are you planning to maintain html5lib-modern going forward? If so, please request the package name transfer on PyPI (they've recently started processing them) and update the fork's metadata to clearly indicate it is a fork and where it is located.

ashleysommer commented 3 weeks ago

The reason for moving away from html5lib to html5lib-modern was actually for the benefit of distro packagers who were trying to package RDFlib. We had two different packagers reach out to ask us to remove the dependency on html5lib because of its subsequent dependency on six. The six library is deprecated and is no longer being included in a variety of distros. The main goal here is to get six out of the dependency tree of as many python packages as possible. Thus, html5lib-modern was designed to be a completely drop-in replacement for html5lib with no code changes needed in the users python code (that's why the module name aliases).

Are you planning to maintain html5lib-modern going forward?

I'm unsure of that at this stage. I created html5lib-modern specifically for the purpose of supporting rdflib, so packagers can still continue to ship rdflib without the dependency on deprecated html5lib and six. However after I mentioned my fork over on the html5lib issue tracker, two distros (Nix, and one other) have already replaced html5lib with html5lib-modern in their package tree. It won't be seeing any enhancements or new features, but I will accept security patches if any are identified.

mgorny commented 3 weeks ago

I'm afraid it can't be a drop-in replacement for as long as it used a different package name in metadata. For distributions, this means that either we have to patch it to change the package name, therefore make it truly compatible, and start patching packages that specify html5lib-modern in install_requires, or go the other way around — leave the new name in, and patch all the old packages to expect html5lib-modern. Either way is suboptimal.

ashleysommer commented 3 weeks ago

I believe the other distros have removed old html5lib entirely, and done something like Install-Provides: html5lib when packaging html5lib-modern.

mgorny commented 3 weeks ago

Sure, distro package-level dependencies are not the problem. However, the Python package metadata is — and pip does not support any kind of name aliasing, so patching individual packages is the only possible solution.

ashleysommer commented 3 weeks ago

@mgorny See below for my solution to resolve this, but first I admit I'm confused on some aspects of the conversation above, I need you to explain the exact issue more clearly for me.

At first I thought you were describing a Pip installation problem (you demonstrated that package name is overwritten with the old html5lib when both are in the dependency tree). That is understandable and expected, the package name was designed to alias.

Then you said its actually a distro packaging issue, because Gentoo cannot have two different packages that install the same Python module (that's fair enough).

When I explained that other distros have completely replaced html5lib with html5lib-modern to provide the same dependency, you replied is not actually a distro packaging problem, but a Pip problem.

"so patching individual packages is the only possible solution."

^ That is the part I don't understand. ^ If html5lib-modern provides the html5lib module, and the other packages rely on any package that provides html5lib, and the other packages use import html5lib in their code, what parts of which packages need to be rewritten?

The solution: I can easily update html5lib-modern codebase to use a different module name. I can publish it as html5lib-modern with version v2.0 to indicate a breaking change. (This plan is actually already mentioned in the html5lib-modern README.rst file.) I'll have to communicate that change to packagers who have already picked up html5lib-modern as a replacement for old html5lib, so they know how to deal with that.

mgorny commented 3 weeks ago

Ah, I'm sorry for the confusion. The problem is roughly that there are two layers to this.

First, there's the Python packaging layer, represented by .dist-info files. If you install html5lib, you get html5lib*.dist-info. Then Python software can find out whether html5lib is installed by doing e.g.:

>>> import importlib.metadata
>>> importlib.metadata.version("html5lib")
'1.1'

When you install html5lib-modern, you get html5lib_modern*.dist-info instead:

>>> import importlib.metadata
>>> importlib.metadata.version("html5lib-modern")
'1.2'
>>> importlib.metadata.version("html5lib")
Traceback (most recent call last):
  File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 397, in from_name
    return next(cls.discover(name=name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 889, in version
    return distribution(distribution_name).version
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 862, in distribution
    return Distribution.from_name(distribution_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/metadata/__init__.py", line 399, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for html5lib

So note that programs written to expect html5lib won't work here. This will also trigger pip check integrity test failures, and it may cause entry points made using pkg_resources not to work (they actually verify installed dependencies, and refuse to run when they aren't found installed via metadata check).

Now, plain pip installs aren't really affected at this point because pip doesn't currently check for file conflicts, i.e. lets you install both packages simultaneously, one overwrite the other and have metadata for both. But as you can imagine, that's ugly and not something you should rely on (it may be fixed in the future, and if not in pip, then at least in uv which is fast replacing pip).


The second layer are distribution packages. The vast majority of distribution packaging solutions actually do check for file conflicts, and therefore don't permit installing html5lib and html5lib-modern packages simultaneously. Naturally, this also implies that the Python-level metadata for only one of these packages is installed.

So if we package html5lib plain, there's no html5lib_modern*.dist-info and packages having it in install_requires have unsatisfied dependencies. If we package html5lib-modern, the opposite applies. The only way here for us is to actually hack the package around to install fake metadata for both packages, and hope it works out somehow. But that's ugly.


I can easily update html5lib-modern codebase to use a different module name. I can publish it as html5lib-modern with version v2.0 to indicate a breaking change. (This plan is actually already mentioned in the html5lib-modern README.rst file.)

Well, that's one option. However, the disadvantage of that is that then users would have to have both variants (i.e. two almost identical packages) installed for a long time (possibly forever, given that some of the packages needing html5lib are pretty much dead, and they're unlikely to switch). It would also mean we'd have to keep patching html5lib which is also suboptimal.

It would be much better if you took over the original name and published this package as plain html5lib. Or alternatively, find someone who would do that and incorporate your changes. I've seen some past discussion around that on html5lib issue tracker, so there are certainly other people wanting to revive html5lib.

ashleysommer commented 3 weeks ago

@mgorny Thanks for the detailed explanation, that is clearer to me now. It didn't occur to me that programs would be checking the metadata of their installed dependencies like that, but I can see why its done.

It would be much better if you took over the original name and published this package as plain html5lib.

I am absolutely not going to do that. html5lib is used by millions of users across many thousands of projects, it has a long history and is one of the most popular python libraries. I already maintain 5 different python modules that take up too much of my time, I'm not going to take on that heritage and that level of burden.

html5lib-modern was created as a drop-in replacement for html5lib, for RDFLib to use so we could get the v7.1.0 release out the door without six in our dependency tree, as requested by distro packagers.

You're right that users over on the html5lib issue tracker have been talking about reviving the project and releasing a new version for years, but there has been no movement on that yet.

Personally I think a major dependent library (eg, beautifulsoup4 or pandas) should take it over, it can't be good for them to have an unmaintained and deprecated library in their dependency tree.