extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.6k stars 140 forks source link

chore: Improvements in handling LD+JSON data #400

Closed andremacola closed 1 month ago

andremacola commented 1 month ago

A real example:

In the url: https://natelinha.uol.com.br/famosos/2024/10/13/esposa-de-rodrigo-faro-vera-viel-compartilha-rotina-no-hospital-apos-cirurgia-217809.php the extraction of the date return empty because the html only has ld+json metadata for this info.

This is fixed with this PR.

For future: I'm thinking to do a more complex logic for ld+json extraction. There is sites for example (like the above one) that include a real author name for the post and not the @sitename like in og:author.

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 11320262134

Details


Totals Coverage Status
Change from base Build 8984156161: -0.2%
Covered Lines: 279
Relevant Lines: 281

💛 - Coveralls
ndaidong commented 1 month ago

@andremacola thank you for your contribution. I will check and merge this soon.