ckruse / microformats2-elixir

Microformats2 parser in Elixir
MIT License
20 stars 7 forks source link

Nested Root Microformats Aren't Parsed Correctly #74

Closed zcdunn closed 10 months ago

zcdunn commented 10 months ago

On my site, I use h-cite for contexts to remote urls (replies, reposts, etc). I have them marked up as described here, where the u-* property of the h-entry is on the same element as the root h-cite and not directly on the link itself. It seems like this library is not getting the url property of the nested h-cite right. It's also incorrectly parsing the category of the outer h-entry as the category of the nested h-cite

For this repost, you can see parse results below that show a repost-of property (at $.items[0].properties.repost-of[0]) with a nested h-cite that contains the correct url:

Parsed repost-of from pin13/unmung

{
    "type": [
        "h-cite"
    ],
    "properties": {
        "name": [
            "gilest.org: Make the indie web easier"
        ],
        "url": [
            "https://gilest.org/indie-easy.html"
        ]
    },
    "lang": "en-US",
    "value": "https://gilest.org/indie-easy.html"
}

Parsed repost-of from microformats2-elixir

%{
  "properties" => %{
    "category" => ["IndieWeb", "tech", "decentralization"],
    "name" => ["gilest.org: Make the indie web easier"],
    "url" => ["gilest.org: Make the indie web easier"]
  },
  "type" => ["h-cite"],
  "value" => "gilest.org: Make the indie web easier"
}
ckruse commented 10 months ago

Hi there,

thank you for the report and sorry that it took so long to answer. I'm a bit busy currently.

May I add this example to the repository as a test case?

zcdunn commented 10 months ago

Sure, you can add it to the repo. And there's no rush on this; thanks for taking the time to look at it.

ckruse commented 10 months ago

I think I found the problem… it looks like a bug in Floki, the HTML parser I use. The element name for this:

        <a
          href="https://gilest.org/indie-easy.html"
          class="u-url entry__link"
          itemprop="url"
        >

is an a\n instead of just a. I have to investigate further…

ckruse commented 10 months ago

Floki was innocent 🫣 it was my bug.

I fixed it and like to release a new version. Do you mind to check if it works for you now?

zcdunn commented 10 months ago

🎉 Thank you! That solved the url parsing.

It's still parsing the outer h-entry's category property as part of the inner h-cite. Should I open a separate issue for that?

ckruse commented 10 months ago

Sigh.

Both bugs are caused by a workaround for a bug in MochiWeb. It doesn't deal very well with whitespaces, so I have to replace them with their entities. And I do it within tags, too, which changes the parse tree of the document.

I have to rewrite whitespace handling…

ckruse commented 10 months ago

Can you check again? I completely overhauled whitespace handling, and your test case (and the old test cases) now work for me.

zcdunn commented 10 months ago

Yep, it's working for me now. Thank you!

ckruse commented 10 months ago

Cool :-) I just released v 1.0.1