EventRegistry / event-registry-python

Python package for API access to news articles and events in the Event Registry
http://eventregistry.org/
MIT License
232 stars 54 forks source link

getArticleUris sometimes null sometimes works (based on order / amount of urls) #63

Open Palmik opened 1 year ago

Palmik commented 1 year ago

Example (this happens for both the Python and REST API (as the Python just calls the REST API directly)

Multiple URLs (the dailymail will get null -- only if it's second, it works if it's first!):

curl --request POST \
     --url "http://eventregistry.org/api/v1/articleMapper" \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
    "articleUrl": [
        "https://www.business-standard.com/article/pti-stories/japan-eyes-record-defence-budget-amid-n-korea-china-threats-118083100326_1.html",
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
    ],
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "XXX"
}
{
    "https://www.business-standard.com/article/pti-stories/japan-eyes-record-defence-budget-amid-n-korea-china-threats-118083100326_1.html": "936069503",
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": null
}

Single URL (the dailymail will be mapped):

curl --request POST \
     --url "http://eventregistry.org/api/v1/articleMapper" \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
    "articleUrl": [
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
    ],
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
}
Palmik commented 1 year ago

Another interesting example:

{

  "articleUrl": [
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490",
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
  ],
  "includeAllVersions": true,
  "deep": true,
  "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460",
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
}

VS

{

  "articleUrl": [
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html",
  "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
  ],
  "includeAllVersions": true,
  "deep": true,
  "apiKey": "XXX"
}
{
    "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460"
}
gregorleban commented 1 year ago

There doesn't seem to be an error related to this API call.

The article that we have in our DB is "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490". When mapping to the URI we also create alternative versions of the urls that we test. One version is without the parameters. Another version is without the "www." prefix.

The URI that you receive is the URI of the article that we have in our database.

Regarding the first reported issue (i.e. not returning uri when providing multiple urls):

In your case, it seems that you've made first the call with a single url (https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html) and later repeated the query with multiple urls. The thing is that the article with this url was found to be a duplicate of the article https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490 which we have found and imported already before. Therefore the article with url https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html (and uri 7763074647) was removed before you made the second query with multiple urls.

I hope this explains the confusion.

Palmik commented 1 year ago

Hi Greg, thanks for the answer.

Unfortunately all of the example URLs from my original message now return null (this seems like a separate issue), so it's hard to verify. But I seem to recall being able to reproduce this behaviour with the same URLs.

What I would like to achieve is:

As you see, getArticleUris does not seem to be robust to query parameter variations. In the last example call, I only got URI for "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490", and not the duplicate. (Whereas in the previous call, I got URI for both).

Since you identified "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" to be duplicate of "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490", why not also return the URI for it?

As it stands, I am not sure how to use the API to reliably get back URIs.

gregorleban commented 1 year ago
{ 
    "articleUrl": [
        "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490"
    ],  
    "includeAllVersions": true,
    "deep": true,
    "apiKey": "{{ _.prodApiKey }}"
}  

this call does not return null, since that is the url that we have in the db.

What you would like to achieve is generally exactly what the article mapper is for. The only issue is that if you have a url that we don't have in the DB, then we cannot return it.

If you provide a url that is not exactly the url that we have in the DB, then in some cases we can resolve the issue and in some not. If you have: a.com/b/c?x=123 when we actually store url: a.com/b/c then you will receive a valid URI from us since we also try resolving to urls without the params.

If, on the other hand, we store url a.com/b/c?x=123 and you provide us url: a.com/b/c then we cannot provide you a valid URI since we don't do approximate searches in our DB and we cannot guess the extra params to your url that would then match our url.

We cannot return you the URI for https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html as this article was removed from the DB and we have no record of it anymore.

My suggestion is that you use the API for your articles. You then take the articles for which you get a valid URI and for the remaining ones you call the https://newsapi.ai/documentation?tab=extractArticleInfo endpoint.

Do you have a particular reason why you need specifically the articleMapper?

Palmik commented 1 year ago

Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.

Out of curiosity, why would the article get deleted?

gregorleban commented 1 year ago

Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing.

Ok. Does the endpoint that I suggested for you (https://newsapi.ai/documentation?tab=extractArticleInfo) therefore work for your purposes?

Out of curiosity, why would the article get deleted?

The articles that get deleted are duplicated articles that come from the same source. So if we see that we imported the same article with a different url multiple times, we remove such duplicates since they bring no value to any user.

Palmik commented 1 year ago

Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase.

gregorleban commented 1 year ago

Extract article info should use 0.05 tokens per url so 5 tokens per 100 articles. Article mapper cannot return you an article for a URL that it hasn't seen or doesn't keep in our database. If you have articles from the sources that we do cover, I don't think there will be may articles for which the article mapper will return null.

On Thu, Oct 12, 2023 at 1:52 PM Petr Pilař @.***> wrote:

Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase.

— Reply to this email directly, view it on GitHub https://github.com/EventRegistry/event-registry-python/issues/63#issuecomment-1759459678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGFVOTI7TKSJYK5U5TXNKDX67KYLANCNFSM6AAAAAA5RALIGY . You are receiving this because you commented.Message ID: @.***>

--

Gregor Leban Phone: +386-31-321-804 Skype: gregorleban

Palmik commented 1 year ago

I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved.

But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return null with ArticleMapper, yet ExtractArticleInfo has no problem finding the URL. So given that ExtractArticleInfo has the information, your system knows about the URL.

gregorleban commented 1 year ago

Aha, sorry for misunderstanding. The Extract article info endpoint is actually not using our db at all. It is using our information extraction service to extract the article information directly from the URL. So it downloads the page and extracts the article information directly from the page.

On Thu, Oct 12, 2023 at 3:30 PM Petr Pilař @.***> wrote:

I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved.

But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. " https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return null with ArticleMapper, yet ExtractArticleInfo has no problem finding the URL. So given that ExtractArticleInfo has the information, your system knows about the URL.

— Reply to this email directly, view it on GitHub https://github.com/EventRegistry/event-registry-python/issues/63#issuecomment-1759618564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGFVOX36NCMURAPI3UFTKDX67WI3ANCNFSM6AAAAAA5RALIGY . You are receiving this because you commented.Message ID: @.***>

--

Gregor Leban Phone: +386-31-321-804 Skype: gregorleban