aaronpk / XRay

X-Ray returns structured data from any URL
https://xray.p3k.app
MIT License
90 stars 15 forks source link

Json is returned with data: type: "unknown" #88

Closed leecalvink closed 5 years ago

leecalvink commented 5 years ago

Hello. I'm trying to send a webmention (for the first time) but I ran into some issues (url not being found, no h-card parsing, etc). If I try either xray or other tools like https://indiewebify.me/ they fail to parse any of my microformats, despite them definitely being there.

I am using Jekyll hosted through Amazon web services. My only guess is that maybe the html is being "sanitized" in such a way by xray that the tags with the attributes aren't being read. Here is the current html in the body for the webmention (cleaned of personal details):

<main class="h-entry">
<span class="u-author h-card">
<img src="_my-image-url_" class="u-photo">
<a href="_site-url_" class="u-url p-name">_my-name_</a>
</span>
replying to: <a class="u-in-reply-to" href="_url-i'm-replying-to_">_that-authors-name_</a>
<p class="e-content">_my-comment_
</p>
<a href="_my-site-url_/replies/testing-webmentions/" class="u-url">
<time class="dt-published" datetime="2019-05-06T12:33:00-06:00">Mon, 06 May 2019 12:33:00 -0600</time>
</a>
</main>

Is the main tag not supported? My other guess is that it's because the content-type header is 'text/html' because if I put the plain html into xray it gives a key-value pair -> source-format : "mf2+html" and returns everything else correctly.

aaronpk commented 5 years ago

I've seen this issue before where various things aren't able to parse the HTML because it's returned gzipped. IIRC there is a config flag somewhere in amazon to make it respect the accept headers or something so that it only returns gzip data to clients that request it. I can't remember where that discussion last was though.

XRay and mf2 doesn't care about what HTML tags you use, they only look at the classes.

leecalvink commented 5 years ago

I have Cloudfront set up but with auto-compression off, meaning it simply returns files from an S3 bucket which is all gzipped with this metadata attached for html files(Content-Encoding: gzip, Content-Type: text/html; charset=UTF-8). The files don't have the .gz extension on them either, if that matters.

I'm not sure exactly what I should change to make webmentions work, but guessing from the fact that returning a gzipped file is causing the problem(If I understood your comment correctly), then I should NOT gzip the original/source file on S3 or only return gzipped content to clients that request it?

For reference for others: I guess this means setting Cloudfront to auto compress files without the Content-Encoding header, and on the S3/origin side to not "pre-gzip" the relevant html files?

leecalvink commented 5 years ago

I uploaded a plain html in place of the gzipped html and xray parses the file perfectly if I put in the url path.

Edit: So it looks like it's a content negotiation thing. I need to not gzip the html, but rather let Cloudfront do that work for me. It will then gzip and send a gzipped file to user agents that send the Accept Encoding: gzip header. Otherwise it will send the uncompressed html file, which in the case of webmentions, we need.

Edit 2: According to some AWS docs it looks like compression can be automatically handled. You need to tweak CORS policy on your bucket to allow the Content-Length header. On Cloudfront you need to set automatic compression on, and then make sure you don't gzip your html yourself, let Cloudfront do it. You need to make sure Content-Encoding is not passed as a header on your html either.

Relevant doc: https://aws.amazon.com/about-aws/whats-new/2015/12/cloudfront-supports-gzip/ A blog post about setting the CORS policy: https://christianoliff.com/blog/optimizing-cloudfront-performance