Pages with OG info not being pulled into Lemmy

harcesz commented 3 years ago

We're seeing some websites, that seem to have working OG information, that's not being pulled properly into Lemmy, most notably;

other cases:

https://outride.rs/

dessalines commented 3 years ago

I'm not sure, would have to add a unit test and see what's going on there.

rgroothuijsen commented 2 years ago

For the first site, at least, the issue is that the OG tags have been placed in the body. The HTML parser expects them to be present in the head.

dessalines commented 2 years ago

dupe of #1858

harcesz commented 2 years ago

This has been closed, as a supposed duplicate, but the problem persists, so bringing it back. This are some of the biggest polish speaking news sites, that make it look bad for the software. A new addition but possibly tied is wrong charset on data scrapped from https://www.tokfm.pl/ and some other sites.

dessalines commented 2 years ago

@harcesz since these sites are outliers, could you inspect the pages and figure out if the OG tags are in the correct place?

harcesz commented 2 years ago

If putting them in the header is the standard than apparently not. But that misses the point from my perspective. Lemmy has to be able to pull them nevertheless, otherwise we land with users seeing that "it works on facebook so it's lemmy that's broken". I took a random link from one of them; https://oko.press/goworit-moskwa-10-uzasadnien-zniszczenia-mariupola/ dropped it into 3 different OG validators, including facebooks;

https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Foko.press%2Fgoworit-moskwa-10-uzasadnien-zniszczenia-mariupola%2F https://opengraphcheck.com/result.php?url=https%3A%2F%2Foko.press%2Fgoworit-moskwa-10-uzasadnien-zniszczenia-mariupola%2F https://smallseotools.com/open-graph/

and all of them would pull all the required tags from this link. While lemmy can't even suggest the title and gives a 'blank' post. And that's while being dependant on a facebook's de facto standard and tags that are there, not even non-standard website with no identifiable titles and images in which case it would still be preferred that Lemmy could offer something to the publishing user.

rysiekpl commented 2 years ago

On one hand I strongly agree that standards matter. Getting out of "anything goes tag soup" stage of Web took us a decade or so, and it was ugly.

On the other hand, I see the need for Lemmy to not be seen as "defective", to bring more users aboard.

I feel like there could be a compromise here. It could be something along the lines:

implement extract the OG tags from <body>; but also
tag posts with such incorrect, non-standard, broken OG as "broken" or some such in a visible way.

That way:

Lemmy gets to not look bad compared to the walled gardens.
Site operators get very clear feedback: this is wrong, fix it.

harcesz commented 2 years ago

Maybe we could add a referrer tag of "get your sh*t together" for anyone reading the logs to pester the people responsible and not the end users?

Meanwhile, I've went through tok.fm and the main problem might be that it's a hot steaming pile of sh*t. Sseriously, the headers itself is >690 lines. But also has <meta charset="ISO-8859-2"> statement is there, and I'm not sure what Lemmy should actually do with such eccentric ideas, but it would be nice to display it without going full windings.

dessalines commented 2 years ago

Just a note that lemmy uses this rust library to extract open-graph / tag information.

https://github.com/orottier/webpage-rs

I can help with making PRs or issues to that repo after we've identified issues, but lemmy doesn't have its own custom tag fetcher.

harcesz commented 2 years ago

Thanks. Commented there and expanded a bit on it, I would now guess our charset problem is related to the wrong tag identification (same network of sites, multiple charset tags).

harcesz commented 1 year ago

Just to note - this problem still exists, no progress on the report there; https://github.com/orottier/webpage-rs/issues/8

Nutomic commented 1 year ago

https://github.com/LemmyNet/lemmy/pull/3338

LemmyNet / lemmy

Pages with OG info not being pulled into Lemmy #1796