Closed felipehertzer closed 2 years ago
Hi @felipehertzer, thank you for your feedback! Let's examine your pull request #91 first and then move on to to this list.
@felipehertzer Did you solve part of the problems above in the PR? Do you have other examples where author extraction fails?
@adbar These sites do not contain meta tag or the html is confusing to extract via xpath. But they use a javascript that I believe is a framework to fill in the data on the front end.
Perthnow -> "byline":{"text":"Finn McHugh"}
This code is found inside a simple script tag.
ESPN -> "articles":[{"id":31807952,"author":"Andrew McGlashan","trackingName":"&lpos=:31807952"},
This code is found inside a simple script tag.
I'm not sure, but maybe we can search for these tags in javascript
Discovery Channel -> "author":"Ian Shive"
This code is found inside an application/ld+json script tag but the author format is different from what we have in REGEX
I tried to solve the problem by adding the code "author?".+?"([^"]+)
along with the request, but this ends up creating a new group and the current code only expects group 1
I believe there are a few more websites, I will be updating the description of this issue later this day.
I've adapted your Regex to get the author's name in more formats, but I'm having a problem testing more specifically at [National Geographic ](https://www.nationalgeographic.co.uk/environment-and-conservation/2020/01/ravenous- wild-goats-ruled-island-over-century-now-its-being) we would have to check if json has information about the author or the photographer.
@felipehertzer Thanks for the details, these cases look tricky indeed.
itemprop="author name"
, so it would be doable to override the JSON info found at the top of the article. I don't know if it's worth the hassle, does it occur often in your data?The author strings could be further refined as "written by" or "by" text parts could be stripped.
Hi @adbar thank you very much for the corrections, some sites I had on the list were fixed with this commit. I found a few more sites I will update the list.
In the next commit I will add the option to remove these prefixes
Hi @adbar I don't know what would be the best way to solve the problem where the sites are passing the person shema different from what is shown on the screen, if you have any ideas to fix this. The problem can be seen on the first 3 sites.
We shouldnt believe on schema person Aap Current:
"author": "Watermark feed",
Should be:"author": "Aaron Bunch",
agenda Current:
"author": "Sandy Cheu",
Should be:"author": "Stephen Teulan; Nikita Weikhardt",
angelicans Current:
"author": "Sydneyanglicans net",
Should be:"author": "Hannah Thiem",
Problem: It is using relative path Sen Current:
None
Should be:"author": "Andrew Mcglashan",
Armida Current:
"author": "Bob Freebairn",
Should be:"author": "Mark Griggs",
it is getting comment author Ausgamers Current:
"author": "Sean",
Should be:"author": "KostaAndreadis",
these two have only the given name should we only allow full name? cath Current:
"author": null,
Should be:"author": "Rebecca",
echo Current:
"author": null,
Should be:"author": "Katie",
@felipehertzer I made a mess while merging PR #101 but everything should be fine now.
The tests still don't pass... I changed the CI system because tests weren't getting run anymore, maybe something slipped through while you were working on the PR.
The JSON metadata analysis could also be useful to extract titles, do you have any idea on how to do that?
Hi @adbar haha it really was a lot of corrections. I did the correction of an 'if' that was incorrect.
Now, it reduces the errors, only python 3.10 and windows still have a problem, according to the log the error would be "Error: Please make sure the libxml2 and libxslt development packages are installed."
Do you say the titles of the articles? or what kind?
Hi @felipehertzer, thanks! As long as the tests pass for Python 3.6 to 3.9 on Linux it's fine.
The article schema makes it possible to define a headline: https://schema.org/Article. Using it could be faster and more accurate than looking for title information throughout the article. This also applies to other metadata if you're interested.
Hi @felipehertzer, I just saw you already implemented the headline feature, thanks! Are there other metadata from schema.org on your To Do List?
Hi @adbar
We can get these things below as a extra, maybe an array with all the images as well.
Organization/Url
NewsArticle/mainEntityOfPage/WebPage - URL of Page
NewsArticle/keywords
NewsArticle/datePublished
NewsArticle/articleBody
There are a lot other informations, like logo, telephone, facebook, address and sometimes job titles...
For now, in my to do list:
It is the most urgent thing to fix.
NewsArticle/datePublished
is handled by htmldate
(on a regex basis) and NewsArticle/articleBody
is already present in the code, but also on regex basis. The latter could indeed be improved.
I'm moving the other ones to the JSON-LD thread (#99) so we can focus on the authors here.
What do you mean by a blacklist of URLs? There is already a url_blacklist
argument to the extract()
and bare_extraction()
functions. Authors blacklisting following the same model would be nice to have.
@felipehertzer I would now move the remaining questions to another thread and close this one. Are you working on a blacklist? (see questions above)
Hi @adbar, yes alright. I've been working on another project, I'll start that in the next week. I was thinking to do a blacklist to ignore page authors so the code can look further. Some sites put false author in meta info.
We shouldnt believe on schema person
agenda Current:
"author": "Sandy Cheu",
Should be:"author": "Stephen Teulan; Nikita Weikhardt",
aged Current:
"author":"Consumers",
Should be:"author": "Liz Alderslade",
meta remove single names cath Current:
"author": null,
Should be:"author": "Rebecca",
echo Current:
"author": null,
Should be:"author": "Katie",