bluesky-social / social-app

The Bluesky Social application for Web, iOS, and Android
https://bsky.app
MIT License
9.7k stars 1.24k forks source link

Blue Sky's composer should gather social graph data for PDFs and other non-HTML content too #1672

Open mlissner opened 1 year ago

mlissner commented 1 year ago

Right now, if you put a link ending in .html into the composer (on Web), and ask the website to generate a card, you can watch the network panel make a request to https://cardyb.bsky.app/v1/extract.

For example, this URL makes a request in the network panel when you ask to make a card:

https://foo.com/foo.html

But these URLs, ending in .jpeg, .png, .pdf and .xml do not:

https://foo.com/foo.jpeg

https://foo.com/foo.png

https://foo.com/foo.xml

https://foo.com/foo.pdf

I understand the reasoning: In theory, those file endings are telling Blue Sky that they will not have Social Graph information, since that information only exists in HTML content.

That theory is correct, but at the website I run, we share millions of PDFs, and we have a neat hack in place to help fight misinformation and provide better details to our users. When we detect an open graph crawler, we redirect the crawler to an HTML page with open graph data (if it's not a crawler, we serve the PDF). I know that DocumentCloud also uses this trick.

This works on Twitter, Facebook, Slack, Mastodon and a bunch of other sites. As far as I know, Blue Sky is the only one where it doesn't work.

To Reproduce

  1. Paste this link into the web composer: https://foo.com/f.html

  2. Open the browser's network panel.

  3. Press the button in the composer to get the card.

  4. Note that it returns an error (the link doesn't work), and that you see a request in the network panel:

    image

  5. Change the URL to https://foo.com/f.pdf

  6. Press the button in the composer to get the card.

  7. Note that it made no requests and throws no error.

Expected behavior

Blue Sky should go to the URL, regardless of the file ending, and test if it's actually HTML or a PDF. Heck, some horribly misconfigured website might end links with .pdf even when serving HTML. :)

Details

Additional context

This bug is a bit of a bummer because one of the things that drove me to Blue Sky is that Twitter removed headlines. This bug means that the links from my website don't have twitter cards or headlines either. Darn!

I took a look around the code, but couldn't find where this is done. If somebody sends a pointer, we've got technical folks and volunteers that could help with this.

mlissner commented 1 year ago

Well, OK, there's a workaround if you're the one posting the link, but it's still broken for everybody that's not this clever.

You can substitute %2e instead of the last period in your link. This works:

https://storage.courtlistener.com/recap/gov.uscourts.flsd.648654/gov.uscourts.flsd.648654.3.0%2epdf

Not, um, exactly, great, but it's something!

mlissner commented 8 months ago

One other thought here. It isn't part of open graph, but I've always thought it would be nice to serve open graph data via headers. In fact, I think Facebook must have gotten distracted while building the spec, and just didn't get around to this.

If BlueSky supported this one day, it'd make it possible to return detailed information and thumbnails when serving binary content.

(I've been banging this drum for a decade or so.)