aarongustafson / jekyll-webmention_io

A Jekyll Plugin for rendering Webmentions via Webmention.io
https://aarongustafson.github.io/jekyll-webmention_io/
MIT License
139 stars 27 forks source link

Malformed URIs can cause the plugin to crash when sending webmentions #178

Closed jcolag closed 8 months ago

jcolag commented 8 months ago

Hi!

I recently added this gem to my blog. It went well for a couple of days, such as connecting to Bridgy to see actual Webmentions showing up on blog posts. So far, so good.

To test things end-to-end, I decided to use yesterday's post (on Indie Web protocols anyway) to send a Webmention to an earlier post. I dropped this into the existing Markdown post. Pardon the specifics.

<div class="h-entry">
  <a
    class="p-author h-card"
    href="https://john.colagioia.net"
    rel="author"
  >
    <img class="u-photo" src="{{ site.url }}/blog/assets/d29181b871b001b0.png" />
    John Colagioia
  </a>,
  in reply to:
  <a class="u-in-reply-to" href="{{ site.url }}{% post_url 2024-03-13-indieweb-1 %}">
    Trying on the Indie Web, Part 1
  </a>
  <p class="p-name e-content">
    I don't know how this will render on the destination side, but I have
    continued the work from this post at
    <a href="{{ site.url }}{% post_url 2024-03-20-indieweb-2 %}">
      {{ page.title }}
    </a>.
  </p>
  <a href="{{ site.url }}{% post_url 2024-03-20-indieweb-2 %}" class="u-url">
    <time class="dt-published" datetime="{{ page.date | date_to_xmlschema }}">
      {{ page.date | date_to_long_string }}
    </time>
  </a>
</div>

For a first pass, I figured that it would make sense to use Liquid plugins, so make this reasonably flexible, with the intent of creating a Webmention-reply plugin for myself, once I had things working.

Unfortunately, I got hit with the following error.

jekyll 4.3.2 | Error: the scheme http does not accept registry part: (or bad hostname?) /home/john/.rvm/rubies/ruby-3.1.2/lib/ruby/3.1.0/uri/generic.rb:207:in `initialize': the scheme http does not accept registry part: (or bad hostname?) (URI::InvalidURIError)

...And then a stack dump running through the RFC 2396 parser and this gem's uri_ok? function, webmention_io.rb:368.

Finding closed issue #163 - slightly different error, but the closest match that I could find - and not seeing my target URL in the cached file, I concluded that the Liquid bits must cause the problem, so I manually filled in the expected data, or at least a quick approximation the published version.

<div class="h-entry">
  <a
    class="p-author h-card"
    href="https://john.colagioia.net"
    rel="author"
  >
    <img class="u-photo" src="https://john.colagioia.net/blog/assets/d29181b871b001b0.png" />
    John Colagioia
  </a>,
  in reply to:
  <a class="u-in-reply-to" href="https://john.colagioia.net/blog/2024/03/13/indieweb-1.html">
    Trying on the Indie Web, Part 1
  </a>
  <p class="p-name e-content">
    I don't know how this will render on the destination side, but I have
    continued the work from this post at
    <a href="https://john.colagioia.net/blog/2024/03/20/indieweb-2.html">
      Deeper in the Indie Web
    </a>.
  </p>
  <a href="https://john.colagioia.net/blog/2024/03/20/indieweb-2.html" class="u-url">
    <time class="dt-published" datetime="2024-03-20T07:30:05-0400">
      March 20, 2024, 7:30 AM
    </time>
  </a>
</div>

This seemed like the wrong direction, since it would make creating a plugin far more difficult. Ideally, I'd like to have posts send "likes" and "comments" in the post body, or mention when I've quoted someone else's work. But this at least seemed like a direction. And I believe that this result should work as a valid Webmention, from everything that I've seen. However, that gives me the same error.

This time, though, I can see this in the "outgoing" cache file, under the current post's URL.

  https://john.colagioia.net/blog/2024/03/13/indieweb-1.html: false
  https://john.colagioia.net/blog/2024/03/20/indieweb-2.html: false

This time, it seems to have picked up the URL, but still had the URL-parsing problem, for some reason.

Is there anything that I can do to make this work? Is there maybe some tag that already exists that I can use to send a one-off reply from within a post? Do I misunderstand the gem's model of Webmentions? That is, I'm thinking of my blog as a blog, where I can use social-media-like features on occasion.

If necessary, I could always hack together code to extract anything in the Webmention microformat, do the lookup, and send it separately, but I'd obviously rather not build all that if it should work here.

Thanks for any help!

fancypantalons commented 8 months ago

Hmm, would you mind enabling debug output for the plugin ala:

webmentions:
  debug: true

And then run a site build followed by a call to the 'webmention' command, and then attach the output along with the full stack trace as a file?

Right now it's not obvious to me what the issue might be, so the extra debugging telemetry might (hopefully) give me a clue. If that doesn't help I can try reproducing this myself, there's just no guarantee I'll be able to manage it!

I will say that URLs that are expanded from liquid tags unfortunately won't be automatically picked up by the plugin, as the webmention gathering phase is done against the source document, not the rendered result. For a normal post referencing a fixed embedded target URL, this is fine. What you're doing--sending a mention to yourself from one of your own posts and so wanting to emit the post URL via a liquid tag rather than "hard coding" it--is a bit of an unusual use case and unfortunately fixing it wouldn't be easy.

jcolag commented 8 months ago

Sure! I already have debugging set for the status updates while it cooks. The bad news is that I've got a thousand posts on a lot of topics, though, so there's a lot of repetition and a ton of links.

Build log (from JEKYLL_ENV=production bundle exec jekyll build): build.log

And the Webmentions log (from bundle exec jekyll webmention): webmention.log

I also have this in my _config.yaml, if that helps any.

webmentions:
  author:
    name: John Colagioia
    url: https://john.colagioia.net
    photo: /blog/assets/d29181b871b001b0.png
  debug: true
  throttle_lookups:
    last_week: daily
    last_month: weekly
    last_year: every 2 weeks
    older: monthly
  username: john.colagioia.net

I don't mind not being able to send from page-to-page, by the way. That was just an offhanded test without disturbing anybody else. But I do want to eventually have plugins where I can embed outgoing mentions in a post like the following.

{% wm_reply Title of Someone's Post|https://example.com/post-url.html|My reply to the post %}
{% wm_like Title of Different Post|https://sub.example.org/other-post.html %}

From your response, it sounds like that probably won't work, unfortunately, but I can probably hack something together to send them.

Actually, can I ask for an example usage, by the way? If I need to write code to handle the outbound Webmentions the way that I imagine them working, because your model looks nothing like mine, the expedient route might be for me to do that, yank out the webmention command, and close the ticket so that I'm not putting the burden on you for something that doesn't make any sense for your vision of the project.

fancypantalons commented 8 months ago

And of course there isn't enough debug logging in that area of code to see what the offending URL was.

Apologies, one more (hopefully last) thing: Could you attach your outgoing webmention cache? I can probably just trawl through there to find which URL is causing the problem.

As for what you're showing in that example, actually that might work because the URL is in the marked up text. The code just does a regex match against the document and anything URL-like that matches is queued up for a webmention.

If you were relying on the tag itself to emit the URL, that wouldn't work. In this case, though, it... should? :)

jcolag commented 8 months ago

Sure, thanks! webmention_io_outgoing.yml.txt

Anyway, that's good to hear. If it doesn't work or doesn't do what I want, I'll worry about it after the rest of this works.

fancypantalons commented 8 months ago

Ahh, progress! So here's all the offending URIs:

---
https://john.colagioia.net/software/2020/01/05/proglang.html:
- "//_"
- "//::=:::"
https://john.colagioia.net/media/2020/04/04/gitgeist.html:
- http://localhost:9005`
https://john.colagioia.net/2020/06/17/plugin.html:
- "//*/*"
- "//*/*`,"
- http://localhost:8080`.
https://john.colagioia.net/2020/07/15/vsconfig.html:
- "//="

These "URIs" break the plugin thanks to a regex that, to my eyes, looks kinda broken:

(?:https?:)?\/\/[^\s)#\[\]{}<>%|\^"']+

Notice that the scheme specifier before the delimiter is allowed to be blank! I have no idea why the regex would've been written that way, so I'm a bit hesitant to change it without thinking about it for a bit. But the result is it'll match expressions in your posts that aren't actually URIs.

But more importantly, the code checking those URIs should a) not just blow up during the queuing process, and b) those bad URIs should never make it into the outgoing cache in the first place.

So I'll throw some guard code in the plugin to toss these bad URIs earlier in the process and to do it without, you know, blowing up.

Meanwhile, you can go through the cache file and update those entries to replace the value "false" with "{}". e.g. Change this:

https://john.colagioia.net/software/2020/01/05/proglang.html:
  https://www.gnu.org/licenses/gpl-3.0.html: false
  https://en.wikipedia.org/wiki/Axel_Thue: false
  https://en.wikipedia.org/wiki/Thue_: false
  https://github.com/jcolag/Thue: false
  "//_": false
  "//::=:::": false

to this:

https://john.colagioia.net/software/2020/01/05/proglang.html:
  https://www.gnu.org/licenses/gpl-3.0.html: false
  https://en.wikipedia.org/wiki/Axel_Thue: false
  https://en.wikipedia.org/wiki/Thue_: false
  https://github.com/jcolag/Thue: false
  "//_": {}
  "//::=:::": {}

This tells the plugin a mention was sent for the URI and it'll be ignored in subsequent calls to the webmention command.

Here's a really ugly little Ruby script to test the file to validate your changes before you re-run (if all the URIs are scrubbed, it should print an empty YAML document):

require 'yaml'
require 'uri'

cache = YAML.load_file('webmention_io_outgoing.yml')
parser = URI::Parser.new
badpages = {}

cache.keys.each do |k|
  baduris = []

  cache[k].keys.each do |k2|
    begin
      parser.parse(parser.escape(k2))
    rescue => e
      baduris = baduris.append(k2)
    end
  end

  if baduris.length() > 0
    badpages[k] = baduris
  end
end

puts badpages.to_yaml
jcolag commented 8 months ago

Ah-ha! That looks like it solved the problem. Mind you, none of the pages that it caught show the Webmentions, but [jekyll-webmention_io] 14 webmentions sent. , plus one that failed because their endpoint doesn't exist, sounds good.

It looks like I'll need to do something separate for the plugins that I mention to embed an outbound mention, but that doesn't seem like too much of a burden.

Thanks for the help!

fancypantalons commented 8 months ago

You bet, thank you for the thorough report and your willingness to work with me and send over the logs I needed to diagnose the issue!

Meanwhile, I've committed a bug fix that addresses this issue that will be included in an anticipated 4.0.1, which I'll target for end of April in case anything else comes in.

Thanks again!