extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.56k stars 134 forks source link

Incorrect resolution when there are multiple Open Graph tags #367

Open SeriousBug opened 1 year ago

SeriousBug commented 1 year ago

The Open Graph protocol states that if there are multiple tags, the first one should be preferred in cases of conflicts. See here: https://ogp.me/#array

But article-extractor prefers the last tag. The code gets all the meta tags in the page, and iterates over them. If there are multiple tags, the last one will override earlier ones: https://github.com/extractus/article-extractor/blob/main/src/utils/extractMetaData.js#L98-L127

ndaidong commented 12 months ago

@SeriousBug yes, I consider that if we get all the values and push them into an array, the output will become more complex and not easy to keep consistent.

SeriousBug commented 12 months ago

Not grabbing all the tags in an array is absolutely a valid design decision. But I think article-extractor should still prefer to scrape the first image rather than the last one.

ndaidong commented 12 months ago

@SeriousBug thank you for your suggestion. I will consider to apply this to the next release.