adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.5k stars 254 forks source link

Issue with multiple authors and preference for meta information #92

Closed felipehertzer closed 2 years ago

felipehertzer commented 3 years ago

We shouldnt believe on schema person

agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

aged Current: "author":"Consumers", Should be: "author": "Liz Alderslade",

meta remove single names cath Current: "author": null, Should be: "author": "Rebecca",

echo Current: "author": null, Should be: "author": "Katie",

adbar commented 3 years ago

Hi @felipehertzer, thank you for your feedback! Let's examine your pull request #91 first and then move on to to this list.

adbar commented 3 years ago

@felipehertzer Did you solve part of the problems above in the PR? Do you have other examples where author extraction fails?

felipehertzer commented 3 years ago

@adbar These sites do not contain meta tag or the html is confusing to extract via xpath. But they use a javascript that I believe is a framework to fill in the data on the front end.

Perthnow -> "byline":{"text":"Finn McHugh"} This code is found inside a simple script tag.

ESPN -> "articles":[{"id":31807952,"author":"Andrew McGlashan","trackingName":"&lpos=:31807952"}, This code is found inside a simple script tag. I'm not sure, but maybe we can search for these tags in javascript

Discovery Channel -> "author":"Ian Shive" This code is found inside an application/ld+json script tag but the author format is different from what we have in REGEX I tried to solve the problem by adding the code "author?".+?"([^"]+) along with the request, but this ends up creating a new group and the current code only expects group 1

I believe there are a few more websites, I will be updating the description of this issue later this day.

felipehertzer commented 3 years ago

I've adapted your Regex to get the author's name in more formats, but I'm having a problem testing more specifically at [National Geographic ](https://www.nationalgeographic.co.uk/environment-and-conservation/2020/01/ravenous- wild-goats-ruled-island-over-century-now-its-being) we would have to check if json has information about the author or the photographer.

adbar commented 3 years ago

@felipehertzer Thanks for the details, these cases look tricky indeed.

The author strings could be further refined as "written by" or "by" text parts could be stripped.

felipehertzer commented 3 years ago

Hi @adbar thank you very much for the corrections, some sites I had on the list were fixed with this commit. I found a few more sites I will update the list.

In the next commit I will add the option to remove these prefixes

felipehertzer commented 3 years ago

Hi @adbar I don't know what would be the best way to solve the problem where the sites are passing the person shema different from what is shown on the screen, if you have any ideas to fix this. The problem can be seen on the first 3 sites.

We shouldnt believe on schema person Aap Current: "author": "Watermark feed", Should be: "author": "Aaron Bunch",

agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

angelicans Current: "author": "Sydneyanglicans net", Should be: "author": "Hannah Thiem",

Problem: It is using relative path Sen Current: None Should be: "author": "Andrew Mcglashan",

Armida Current: "author": "Bob Freebairn", Should be: "author": "Mark Griggs",

it is getting comment author Ausgamers Current: "author": "Sean", Should be: "author": "KostaAndreadis",

these two have only the given name should we only allow full name? cath Current: "author": null, Should be: "author": "Rebecca",

echo Current: "author": null, Should be: "author": "Katie",

adbar commented 3 years ago

@felipehertzer I made a mess while merging PR #101 but everything should be fine now.

The tests still don't pass... I changed the CI system because tests weren't getting run anymore, maybe something slipped through while you were working on the PR.

The JSON metadata analysis could also be useful to extract titles, do you have any idea on how to do that?

felipehertzer commented 3 years ago

Hi @adbar haha it really was a lot of corrections. I did the correction of an 'if' that was incorrect.

Now, it reduces the errors, only python 3.10 and windows still have a problem, according to the log the error would be "Error: Please make sure the libxml2 and libxslt development packages are installed."

Do you say the titles of the articles? or what kind?

adbar commented 3 years ago

Hi @felipehertzer, thanks! As long as the tests pass for Python 3.6 to 3.9 on Linux it's fine.

The article schema makes it possible to define a headline: https://schema.org/Article. Using it could be faster and more accurate than looking for title information throughout the article. This also applies to other metadata if you're interested.

adbar commented 3 years ago

Hi @felipehertzer, I just saw you already implemented the headline feature, thanks! Are there other metadata from schema.org on your To Do List?

felipehertzer commented 3 years ago

Hi @adbar

We can get these things below as a extra, maybe an array with all the images as well.

Organization/Url
NewsArticle/mainEntityOfPage/WebPage - URL of Page
NewsArticle/keywords
NewsArticle/datePublished
NewsArticle/articleBody

There are a lot other informations, like logo, telephone, facebook, address and sometimes job titles...

For now, in my to do list:

It is the most urgent thing to fix.

adbar commented 3 years ago

NewsArticle/datePublished is handled by htmldate (on a regex basis) and NewsArticle/articleBody is already present in the code, but also on regex basis. The latter could indeed be improved. I'm moving the other ones to the JSON-LD thread (#99) so we can focus on the authors here.

What do you mean by a blacklist of URLs? There is already a url_blacklist argument to the extract() and bare_extraction() functions. Authors blacklisting following the same model would be nice to have.

adbar commented 3 years ago

@felipehertzer I would now move the remaining questions to another thread and close this one. Are you working on a blacklist? (see questions above)

felipehertzer commented 3 years ago

Hi @adbar, yes alright. I've been working on another project, I'll start that in the next week. I was thinking to do a blacklist to ignore page authors so the code can look further. Some sites put false author in meta info.