aaronpk / XRay

X-Ray returns structured data from any URL
https://xray.p3k.app
MIT License
90 stars 15 forks source link

img alt text is included in p-name #54

Closed aaronpk closed 6 years ago

aaronpk commented 6 years ago

This issue is part of #52.

The alt text of img tags is included in the parsed name value, so I can't remove it from the name value, causing XRay to think this is a named post when it tries to dedupe the content and name values.

HTML

<html>
  <head>
    <title>Test</title>
  </head>
  <body class="h-entry">
    <p class="e-content p-name">This is a photo post with an <code>img</code> tag inside the content. <img class="u-photo" src="http://target.example.com/photo.jpg" alt="a photo"></p>
  </body>
</html>

mf2 json

        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "This is a photo post with an img tag inside the content. a photo"
                ],
                "photo": [
                    "http://target.example.com/photo.jpg"
                ],
                "content": [
                    {
                        "html": "This is a photo post with an <code>img</code> tag inside the content. <img class=\"u-photo\" src=\"http://target.example.com/photo.jpg\" alt=\"a photo\">",
                        "value": "This is a photo post with an img tag inside the content. a photo"
                    }
                ]
            }
        }

https://pin13.net/mf2/?id=20180112184043608

This would be solved by https://github.com/microformats/microformats2-parsing/issues/16

aaronpk commented 6 years ago

This is solved in 66adfbe2f8ea311d392674ac4067f5487544fe28 by XRay doing its own plaintext conversion of the HTML after first using the parsed mf2 to dedupe name/content.

aaronpk commented 6 years ago

now parsed as:

{
    "type": "entry",
    "photo": [
        "http://target.example.com/photo.jpg"
    ],
    "content": {
        "text": "This is a photo post with an img tag inside the content.",
        "html": "This is a photo post with an <code>img</code> tag inside the content."
    }
}