getnikola / nikola

A static website and blog generator
https://getnikola.com/
MIT License
2.62k stars 451 forks source link

WebP images in a gallery cause an unhandled exception during build #3671

Closed tartley closed 1 year ago

tartley commented 1 year ago

Environment

Python Version:: Python3.10.6 (and 3.11.0rc1 behaves the same) Nikola Version: v8.2.3 (and latest git clone of commit 483ffd8f Wed Mar 1 22:52:02 2023 behaves the same) Operating System: Linux Pop_OS 22.04 LTS (derived from Ubuntu 22.04)

Description:

Hi, good people! I am happy to look into contributing a PR to fix this myself, but wanted to discuss it first before wading in. Thanks!

Symptom

Adding a WebP image (with '.webp' extension) to my gallery causes nikola build to fail partway through, with a traceback (below). This happens even on a fresh site, created using nikola init -dq site.

After the failure, running nikola serve displays the gallery correctly, including all the WebP images, but the gallery RSS feed is empty. Because this is so close to working, I'm hopeful that a small fix might be possible.

I notice that galleries with an empty file named "image.webp" provoke this error, while gallery files named "image", with no extension, or "image.nonsense", are correctly ignored. So I infer that webp images are expected to work in galleries. (and the section below, "Any other image types we should handle?", seems to corroborate.)

The expected behavior is:

The Traceback

$ nikola build
Scanning posts........done!
.  render_galleries:output/galleries/demo/sp.webp
.  render_galleries:output/galleries/demo/index.html
.  render_galleries:output/galleries/demo/rss.xml
TaskError - taskid:render_galleries:output/galleries/demo/rss.xml
PythonAction Error
Traceback (most recent call last):
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/doit/action.py", line 461, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "/home/jhartley/Dropbox/jon/data/code/3rdparty/nikola/nikola/nikola/plugins/task/galleries.py", line 773, in gallery_rss
    data = rss_obj.to_xml(encoding='utf-8')
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 39, in to_xml
    self.write_xml(f, encoding)
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 34, in write_xml
    self.publish(handler)
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 380, in publish
    item.publish(handler)
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 440, in publish
    self.enclosure.publish(handler)
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 221, in publish
    _element(handler, "enclosure", None,
  File "/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/PyRSS2Gen.py", line 47, in _element
    handler.startElement(name, d)
  File "/usr/lib/python3.10/xml/sax/saxutils.py", line 170, in startElement
    self._write(' %s=%s' % (name, quoteattr(value)))
  File "/usr/lib/python3.10/xml/sax/saxutils.py", line 60, in quoteattr
    data = escape(data, entities)
  File "/usr/lib/python3.10/xml/sax/saxutils.py", line 27, in escape
    data = data.replace("&", "&")
AttributeError: 'NoneType' object has no attribute 'replace'

The problem stems from slightly earlier, nikola/plugins/task/galleries.py, line 753:

mimetypes.guess_type(img)[0]

Where 'img' is the filename of the WebP image, ending with extension '.webp'. .guess_type() doesn't recognize the MIME type of this, and returns (None, None). The first of those None values causes problems a little later in the code (see the traceback), where it is used in XML/RSS generation.

Analysis

The docs for the standard library function .guess_type() say that it recognizes the MIME types that are registered with IANA. They list 'image/webp' in a draft RFC, so this isn't yet an official MIME type.

Sure enough, 'webp' is not included in the files consulted by the Python mimetypes package, which can be listed using:

>>> mimetypes.knownfiles
['/etc/mime.types', '/etc/httpd/mime.types', '/etc/httpd/conf/mime.types', '/etc/apache/mime.types', '/etc/apache2/mime.types', '/usr/local/etc/httpd/conf/mime.types', '/usr/local/lib/netscape/mime.types', '/usr/local/etc/httpd/conf/mime.types', '/usr/local/etc/mime.types']

So it's not surprising that guess_type() returns None.

Possible Solutions

I know nothing, so these might be bad ideas. Each checklist item is described below.

Inject a hardcoded 'image/web' mimetype at runtime, before .guess_type() gets called

It sounds like the mimetypes module has ways to inject custom mimetypes into its internal list. We could add ".webp" -> "image/webp" before .guess_type() gets called.

Add a test of gallery RSS content

A new test of RSS content would be useful, currently I think this is only verified in the CI baseline tests. Once such a test exists in the Python test suite, it (or a minor variation on it) could be used to verify the proposed fallback for unknown mimetypes (see the next item below).

Be robust to files with unrecognized mime type

Regardless of whether we decide to support WebP images here or not, we should be robust to .guess_type() returning (None, None) for files with unrecognized mimetype. There must be a more graceful way to handle this than ending the build with an unhandled exception. But what should that be?

Check other image types we should handle.

Are there other images types that might work in a gallery but are impacted by this problem? If I'm going to look at this, I might as well systematically test them.

Although that did uncover some other problems, tracked there, none of the other image types break the build like this. They all have a recognized mimetype.

Review other calls to mimetype.guess_type()

The code contains three calls to mimetype.guess_type():

  1. nikola/plugins/task/galleries.py, line 753: The code under discussion, for producing RSS enclosures for images in a gallery.
  2. nikola/nikola.py, line 374: Produces RSS enclosures for posts.
  3. nikola/plugins/command/auto/__init__.py, line 574: This code is doing something else, not related to RSS as far as I can see.

Do calls (2) and (3) also need guarding against (None, None) return values?

Review whether "producing an RSS enclosure" is duplicated code.

Do calls (1) and (2) above represent duplicated code, to produce an RSS enclosure, that should be extracted and shared in a function somehow? On first glance I don't think so, but plan to review the idea and would be interested to hear any opinions.

Rejected ideas

Rejected idea: Call mimetypes.init()

My first glance at the mimetypes docs pages made me think we ought to be calling mimetypes.init() before calling .guess_type(). But later I noticed the docs says:

If the module has not been initialized, [functions] will call init() if they rely on the information init() sets up.

So we don't need to call .init() explicitly.

Rejected idea: Consider manually adding the webp mime type to my machine

I see StackOverflow questions about not-entirely-dissimilar problems in other software, which are also caused by the lack of a known MIME type for WebP images. A suggested solution there is to add the 'image/webp' MIME type to one of the 'knownfiles' listed above.

This sounds like a temporary workaround for users who hit this today. But editing these files presumably conflicts with the distributions management of them, which will have bad side effects (e.g. updates could overwrite the user's changes.)

Rejected idea: Consider automatically adding the webp mime type on Nikola install

Instead of suggesting to users that they change the contents of the 'knownfiles' listed above, we could do it automatically for them on Nikola install. This sounds like a bad idea which could cause problems for us in the future (e.g. when our changes are overwritten by an automated update of the changed file, leaving the user in a broken state again.)

Also, this would probably require admin/root privs, which most Nikola installs (hopefully) do not have.

Thanks! :heart:

Huge thanks for all the amazing Nikola goodness! I've been using and loving it for years!

Kwpolska commented 1 year ago

Thanks for this very detailed report!

I think we should do both of these:

  • Inject a hardcoded 'image/web[p]' mimetype at runtime, before .guess_type() gets called
  • Be robust to files with unrecognized mime type

As for this:

  • Review whether "producing an RSS enclosure" is duplicated code.

I think the best course of action would be to have a helper function that (a) ensures we have registered webp with the library, and (b) replaces None with application/octet-stream. since I suppose that would be enough.

Would you be willing to send a pull request with a patch to resolve this? (It can focus only on webp for now).

tartley commented 1 year ago

Yes, I already tried out some fixes very much like your recommendations (they work!) and plan to figure out how your tests work next, see how I should test the changes.

I'm also accumulating a small branch of superficial fixes, such as quashing deprecation warnings I see during the test run. I'll submit it as a separate PR. #3673

tartley commented 1 year ago

Observation: If I modify the gallery RSS generation code to return a valid but empty RSS file, then no tests fail. So I conclude this isn't actually tested (Ah, and I belatedly realize the coverage output after the tests confirms this), and my intent right now is to create a minimal new test for the gallery RSS generator.

Looking at a static site generator after so many years of looking at web API code for work is breaking my brain. I keep spending 30 seconds looking for the URL routing pointing to request handlers, before realizing, oh, hang on, that all doesn't exist here. :-)

tartley commented 1 year ago

Observation: Adding a .webp image to nikola/data/samplesite/galleries/demo makes many tests fail (e.g. some with the build exception described above, but others because sitemap.xml is not generated, etc). Adding the line mimetypes.add_type('image/webp', '.webp') to the gallery RSS generation fixes everything. Perhaps this is a sufficient test failure to justify making this one-line code change, thus saving me the work of having to write a new test for gallery rss generation. (or, at least, allowing us to decouple the tasks, so I could submit the extra test later, in a separate MP.)

Imma submit it, tell me what you think...

tartley commented 1 year ago

I don't want to sully your project with unclosed issues, so although there are a few checkbox items still left undone on this, I'm going to close it, because I don't think I'm going to get around to doing the last few items in the forseeable.

Thanks for your patience and guidance getting the things done I wanted to see fixed! Thanks for Nikola!