Closed swentel closed 3 years ago
Ruh roh, looks like this may be granary's fault after all. I see this in https://granary.io/twitter/@me/@all/@app/1341785945364811776?format=html&... :
<div class="e-content p-name">
The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
</div>
Argh. I'll look into it.
I deployed a fix to granary and checked XRay, it seems happy now. Feel free to retry!
@snarfed indeed, that fixes it.
However, the problem remains in Xray too - or maybe it's in mf2 parsing, too early in the morning now to debug :)
Payload underneath to test: I added an article before and after that tweet, with the unescaped characters again; if you paste that, you only get two items.
pinging @aaronpk as well.
<article class="h-entry">
<span class="p-uid">tag:twitter.com:1341785945364811771</span>
<time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
<span class="p-author h-card">
<data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
<a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
<span class="p-nickname">simonw</span>
<img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
</span>
<a class="u-url" href="https://twitter.com/simonw/status/1341785945364811771">https://twitter.com/simonw/status/1341785945364811771</a>
<div class="e-content p-name">
Some text before
</div>
<a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>
<article class="h-entry">
<span class="p-uid">tag:twitter.com:1341785945364811776</span>
<time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
<span class="p-author h-card">
<data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
<a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
<span class="p-nickname">simonw</span>
<img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
</span>
<a class="u-url" href="https://twitter.com/simonw/status/1341785945364811776">https://twitter.com/simonw/status/1341785945364811776</a>
<div class="e-content p-name">
The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
</div>
<a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>
<article class="h-entry">
<span class="p-uid">tag:twitter.com:1341785945364811772</span>
<time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
<span class="p-author h-card">
<data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
<a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
<span class="p-nickname">simonw</span>
<img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
</span>
<a class="u-url" href="https://twitter.com/simonw/status/1341785945364811772">https://twitter.com/simonw/status/1341785945364811772</a>
<div class="e-content p-name">
Some text after
</div>
<a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>
So it's probably in the mf2 library, pasting that payload at http://pin13.net/mf2/ only returns 2 items. Looking at the result, I figure there's some sort of html correcting going on (you see a closing script tag), but this mixes things up.
pinging @Zegnat and @gRegorLove
Updated the title to reflect better what's going on.
@swentel I'm not certain if the mf2 parser should be correcting HTML (or to what extent). With an unmatched script
element it gets particularly tricky since the parsing algorithm involves removing them to parse some values. Happy to discuss further in a php-mf2 issue or #indieweb-dev. I think this XRay issue is complete, though?
True, nothing to do anymore here.
It looks like there might be a problem when the 'script' tag is included in a tweet. It's an assumption at the moment, the problem might be at granary too.
This is the tweet: https://twitter.com/simonw/status/1341785945364811776 (got that because adactio retweeted it)
I'm using granary to generate an HTML version of my twitter followers, which is then parsed by XRay. When I tested it with adactio alone (and @self) and pasted the body into https://xray.p3k.app/, the output is pretty weird: every article is included into a single content property.
So, this might be a granary problem, although I think it's fine as far as I can see.