aaronpk / XRay

X-Ray returns structured data from any URL
https://xray.p3k.app
MIT License
90 stars 15 forks source link

Sanitizing/correcting of (script) tag(s) #102

Closed swentel closed 3 years ago

swentel commented 3 years ago

It looks like there might be a problem when the 'script' tag is included in a tweet. It's an assumption at the moment, the problem might be at granary too.

This is the tweet: https://twitter.com/simonw/status/1341785945364811776 (got that because adactio retweeted it)

I'm using granary to generate an HTML version of my twitter followers, which is then parsed by XRay. When I tested it with adactio alone (and @self) and pasted the body into https://xray.p3k.app/, the output is pretty weird: every article is included into a single content property.

So, this might be a granary problem, although I think it's fine as far as I can see.

snarfed commented 3 years ago

Ruh roh, looks like this may be granary's fault after all. I see this in https://granary.io/twitter/@me/@all/@app/1341785945364811776?format=html&... :

<div class="e-content p-name">
  The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
  </div>

Argh. I'll look into it.

snarfed commented 3 years ago

I deployed a fix to granary and checked XRay, it seems happy now. Feel free to retry!

swentel commented 3 years ago

@snarfed indeed, that fixes it.

However, the problem remains in Xray too - or maybe it's in mf2 parsing, too early in the morning now to debug :)

Payload underneath to test: I added an article before and after that tweet, with the unescaped characters again; if you paste that, you only get two items.

pinging @aaronpk as well.

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811771</span>

  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>

  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811771">https://twitter.com/simonw/status/1341785945364811771</a>
  <div class="e-content p-name">
  Some text before
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811776</span>

  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>

  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811776">https://twitter.com/simonw/status/1341785945364811776</a>
  <div class="e-content p-name">

  The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811772</span>

  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>

  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811772">https://twitter.com/simonw/status/1341785945364811772</a>
  <div class="e-content p-name">
  Some text after
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>
swentel commented 3 years ago

So it's probably in the mf2 library, pasting that payload at http://pin13.net/mf2/ only returns 2 items. Looking at the result, I figure there's some sort of html correcting going on (you see a closing script tag), but this mixes things up.

pinging @Zegnat and @gRegorLove

swentel commented 3 years ago

Updated the title to reflect better what's going on.

gRegorLove commented 3 years ago

@swentel I'm not certain if the mf2 parser should be correcting HTML (or to what extent). With an unmatched script element it gets particularly tricky since the parsing algorithm involves removing them to parse some values. Happy to discuss further in a php-mf2 issue or #indieweb-dev. I think this XRay issue is complete, though?

swentel commented 3 years ago

True, nothing to do anymore here.