Closed harcesz closed 1 year ago
I'm not sure, would have to add a unit test and see what's going on there.
For the first site, at least, the issue is that the OG tags have been placed in the body. The HTML parser expects them to be present in the head.
dupe of #1858
This has been closed, as a supposed duplicate, but the problem persists, so bringing it back. This are some of the biggest polish speaking news sites, that make it look bad for the software. A new addition but possibly tied is wrong charset on data scrapped from https://www.tokfm.pl/ and some other sites.
@harcesz since these sites are outliers, could you inspect the pages and figure out if the OG tags are in the correct place?
If putting them in the header is the standard than apparently not. But that misses the point from my perspective. Lemmy has to be able to pull them nevertheless, otherwise we land with users seeing that "it works on facebook so it's lemmy that's broken". I took a random link from one of them; https://oko.press/goworit-moskwa-10-uzasadnien-zniszczenia-mariupola/ dropped it into 3 different OG validators, including facebooks;
https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Foko.press%2Fgoworit-moskwa-10-uzasadnien-zniszczenia-mariupola%2F https://opengraphcheck.com/result.php?url=https%3A%2F%2Foko.press%2Fgoworit-moskwa-10-uzasadnien-zniszczenia-mariupola%2F https://smallseotools.com/open-graph/
and all of them would pull all the required tags from this link. While lemmy can't even suggest the title and gives a 'blank' post. And that's while being dependant on a facebook's de facto standard and tags that are there, not even non-standard website with no identifiable titles and images in which case it would still be preferred that Lemmy could offer something to the publishing user.
On one hand I strongly agree that standards matter. Getting out of "anything goes tag soup" stage of Web took us a decade or so, and it was ugly.
On the other hand, I see the need for Lemmy to not be seen as "defective", to bring more users aboard.
I feel like there could be a compromise here. It could be something along the lines:
<body>
; but alsoThat way:
Maybe we could add a referrer tag of "get your sh*t together" for anyone reading the logs to pester the people responsible and not the end users?
Meanwhile, I've went through tok.fm and the main problem might be that it's a hot steaming pile of sh*t. Sseriously, the headers itself is >690 lines. But also has <meta charset="ISO-8859-2">
statement is there, and I'm not sure what Lemmy should actually do with such eccentric ideas, but it would be nice to display it without going full windings.
Just a note that lemmy uses this rust library to extract open-graph / tag information.
https://github.com/orottier/webpage-rs
I can help with making PRs or issues to that repo after we've identified issues, but lemmy doesn't have its own custom tag fetcher.
Thanks. Commented there and expanded a bit on it, I would now guess our charset problem is related to the wrong tag identification (same network of sites, multiple charset tags).
Just to note - this problem still exists, no progress on the report there; https://github.com/orottier/webpage-rs/issues/8
We're seeing some websites, that seem to have working OG information, that's not being pulled properly into Lemmy, most notably;
other cases: