SuLab / WikidataIntegrator

A Wikidata Python module integrating the MediaWiki API and the Wikidata SPARQL endpoint
MIT License
244 stars 46 forks source link

The entity is too big #176

Closed pierredittgen closed 3 years ago

pierredittgen commented 3 years ago

I use wikidataintegrator 0.8.7

Trying to add a 2000 WDQuantity (same property, only value differs) to an existing item, I encounter a problem on item.write():

File "/usr/local/lib/python3.8/site-packages/wikidataintegrator/wdi_core.py", line 1245, in write
    raise WDApiError(json_data)
wikidataintegrator.wdi_core.WDApiError: {'error': {'code': 'rawmessage', 'info': 'The entity is too big. The maximum allowed entity siz
e is 2 MB.', 'messages': [{'name': 'wikibase-api-', 'parameters': ['The entity is too big. The maximum allowed entity size is 2 MB.'],
'html': {'*': 'The entity is too big. The maximum allowed entity size is 2 MB.'}}, {'name': 'wikibase-error-entity-too-big', 'parameter
s': [{'size': 2097152}], 'html': {'*': 'The entity is too big. The maximum allowed entity size is 2 MB.'}}], ...

I understand that the query hits my wikibase limits. I could increase maximum allowed size on the server side but it won't solve the problem if the data to send still increase...

I wonder if wikidataintegrator library provides a way to deal with big insertion queries like this, e.g. using an iterator on data and cutting big queries into smaller ones to ensure staying within the limits of wikibase?

andrawaag commented 3 years ago

the write function is a wrapper around the wbeditentity API call in wikibase. As such it is not possible to tweak it to dissect the query in manageable parts.

There are some hacks that could apply here, though.

  1. One is to divert the directionality of a property. So instead of adding 2000 items to a single item, add 1 item to 2000 items. I am only adding this as an example while I am aware that this will not work in your case where you are adding 2000 WBQuantities.
  2. If it is not possible to extend the filesize accepted by your wikibase, I am afraid the only option is to write wrappers around API calls like wbsetclaim. Those wrappers are straightforward to write and for some wrappers are already implemented e.g. get_wd_entity

Initially, I was actually building bots this way, i.e. statement by statements, qualifier by qualifier, reference by reference. Not long after starting doing so, I got some pushback for a large number of API calls for a single edit item. It was in this interaction that I learned about the wbeditentity call that submits a single item for upload in contrast to the many for a single statement. In your case, the 2000+ quantities would lead to 2000+ API call's and if those included references and qualifiers, even more.

Long story short, cutting big queries into smaller ones can lead to a huge increase in API load.

I also don't know if the allowed size is also a threshold when split in multiple API calls. If I am not mistaken, a wikibase item is a json blob stored in a relational database. If the limit is set on the full size of the json blob, your only option is to increase that limit, or restructure the semantic model (a.k.a splitting the item in to multiple items)

pierredittgen commented 3 years ago

Thanks @andrawaag for this explanation and these suggestions.