Open Palmik opened 1 year ago
Another interesting example:
{
"articleUrl": [
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
],
"includeAllVersions": true,
"deep": true,
"apiKey": "XXX"
}
{
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
}
VS
{
"articleUrl": [
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
],
"includeAllVersions": true,
"deep": true,
"apiKey": "XXX"
}
{
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460"
}
There doesn't seem to be an error related to this API call.
The article that we have in our DB is "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490". When mapping to the URI we also create alternative versions of the urls that we test. One version is without the parameters. Another version is without the "www." prefix.
The URI that you receive is the URI of the article that we have in our database.
Regarding the first reported issue (i.e. not returning uri when providing multiple urls):
In your case, it seems that you've made first the call with a single url (https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html) and later repeated the query with multiple urls. The thing is that the article with this url was found to be a duplicate of the article https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490 which we have found and imported already before. Therefore the article with url https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html (and uri 7763074647) was removed before you made the second query with multiple urls.
I hope this explains the confusion.
Hi Greg, thanks for the answer.
Unfortunately all of the example URLs from my original message now return null
(this seems like a separate issue), so it's hard to verify. But I seem to recall being able to reproduce this behaviour with the same URLs.
What I would like to achieve is:
https://example.com/foo
is the same as https://example.com/foo?bar=1
, then both of these URLs should return some (the same?) URI.As you see, getArticleUris
does not seem to be robust to query parameter variations. In the last example call, I only got URI for "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
, and not the duplicate. (Whereas in the previous call, I got URI for both).
Since you identified "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
to be duplicate of "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
, why not also return the URI for it?
As it stands, I am not sure how to use the API to reliably get back URIs.
{
"articleUrl": [
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490"
],
"includeAllVersions": true,
"deep": true,
"apiKey": "{{ _.prodApiKey }}"
}
this call does not return null, since that is the url that we have in the db.
What you would like to achieve is generally exactly what the article mapper is for. The only issue is that if you have a url that we don't have in the DB, then we cannot return it.
If you provide a url that is not exactly the url that we have in the DB, then in some cases we can resolve the issue and in some not.
If you have:
a.com/b/c?x=123
when we actually store url:
a.com/b/c
then you will receive a valid URI from us since we also try resolving to urls without the params.
If, on the other hand, we store url
a.com/b/c?x=123
and you provide us url:
a.com/b/c
then we cannot provide you a valid URI since we don't do approximate searches in our DB and we cannot guess the extra params to your url that would then match our url.
We cannot return you the URI for https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html
as this article was removed from the DB and we have no record of it anymore.
My suggestion is that you use the API for your articles. You then take the articles for which you get a valid URI and for the remaining ones you call the https://newsapi.ai/documentation?tab=extractArticleInfo
endpoint.
Do you have a particular reason why you need specifically the articleMapper?
Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.
Out of curiosity, why would the article get deleted?
Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.
Ok. Does the endpoint that I suggested for you (https://newsapi.ai/documentation?tab=extractArticleInfo) therefore work for your purposes?
Out of curiosity, why would the article get deleted?
The articles that get deleted are duplicated articles that come from the same source. So if we see that we imported the same article with a different url multiple times, we remove such duplicates since they bring no value to any user.
Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase.
Extract article info should use 0.05 tokens per url so 5 tokens per 100 articles. Article mapper cannot return you an article for a URL that it hasn't seen or doesn't keep in our database. If you have articles from the sources that we do cover, I don't think there will be may articles for which the article mapper will return null.
On Thu, Oct 12, 2023 at 1:52 PM Petr Pilař @.***> wrote:
Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase.
— Reply to this email directly, view it on GitHub https://github.com/EventRegistry/event-registry-python/issues/63#issuecomment-1759459678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGFVOTI7TKSJYK5U5TXNKDX67KYLANCNFSM6AAAAAA5RALIGY . You are receiving this because you commented.Message ID: @.***>
I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved.
But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return null
with ArticleMapper, yet ExtractArticleInfo has no problem finding the URL. So given that ExtractArticleInfo has the information, your system knows about the URL.
Aha, sorry for misunderstanding. The Extract article info endpoint is actually not using our db at all. It is using our information extraction service to extract the article information directly from the URL. So it downloads the page and extracts the article information directly from the page.
On Thu, Oct 12, 2023 at 3:30 PM Petr Pilař @.***> wrote:
I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved.
But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. " https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return null with ArticleMapper, yet ExtractArticleInfo has no problem finding the URL. So given that ExtractArticleInfo has the information, your system knows about the URL.
— Reply to this email directly, view it on GitHub https://github.com/EventRegistry/event-registry-python/issues/63#issuecomment-1759618564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGFVOX36NCMURAPI3UFTKDX67WI3ANCNFSM6AAAAAA5RALIGY . You are receiving this because you commented.Message ID: @.***>
Example (this happens for both the Python and REST API (as the Python just calls the REST API directly)
Multiple URLs (the dailymail will get null -- only if it's second, it works if it's first!):
Single URL (the dailymail will be mapped):