FIWARE / context.Orion-LD

Context Broker and CEF building block for context data management which supports both the NGSI-LD and the NGSI-v2 APIs
https://www.etsi.org/deliver/etsi_gs/CIM/001_099/009/01.06.01_60/gs_CIM009v010601p.pdf
GNU Affero General Public License v3.0
50 stars 43 forks source link

FATAL orionldState error during concurrent POST requests #408

Open michaeI-s opened 4 years ago

michaeI-s commented 4 years ago

Hi,

I'm doing a lot of POST /entities requests from a request queue in a Node.js script. After some number of successfully served requests I get a "socket hang up" error in my script. The Orion-LD log reads:

time=Tuesday 10 Mar 17:16:35 2020.571Z | lvl=FATAL | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=orionldState.cpp[239]:orionldStateDelayedKjFreeEnqueue | msg=Internal Error (the size of orionldState.delayedKjFreeVec needs to be augmented)

It's not actually a crash, because there is no stack trace. After that Orion is not reachable anymore. Its docker container exited.

I'm using the following Oron-LD version:

{
  "Orion-LD version": "post-v0.2.0",
  "based on orion": "1.15.0-next",
  "kbase version": "0.4",
  "kalloc version": "0.4",
  "khash version": "0.4",
  "kjson version": "0.4",
  "boost version": "1_62",
  "microhttpd version": "0.9.48-0",
  "openssl version": "OpenSSL 1.1.0l  10 Sep 2019",
  "mongo version": "1.1.3",
  "rapidjson version": "1.0.2",
  "libcurl version": "7.52.1",
  "libuuid version": "UNKNOWN",
  "branch": "(HEAD",
  "Next File Descriptor": 18
}

Kind regards, Michael

kzangeli commented 4 years ago

Ok! The broker exits willingly and tells us what's wrong. The fix is simply to increase a size I will take some time though to try to understand why a bigger size is needed.

michaeI-s commented 4 years ago

Still

time=Wednesday 06 May 13:07:59 2020.051Z | lvl=FATAL | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=orionldState.cpp[243]:orionldStateDelayedKjFreeEnqueue | msg=Internal Error (the size of orionldState.delayedKjFreeVec needs to be augmented)

as with latest version:

{
  "orionld version": "post-v0.2.0",
  "orion version": "1.15.0-next",
  "uptime": "0 d, 0 h, 0 m, 30 s",
  "git_hash": "c38344b87377681a4ef8aea84a9937b7f2319d9b",
  "compile_time": "Tue May 5 17:58:03 UTC 2020",
  "compiled_by": "root",
  "compiled_in": "9e9a6eb98eaa",
  "release_date": "Tue May 5 17:58:03 UTC 2020",
  "doc": "https://fiware-orion.readthedocs.org/en/master/"
}

At least on my test machine and with my data this error is reproducible when setting the limitparameter to >=500:

curl --location --request GET '<ORION-LD-ADDRESS>/ngsi-ld/v1/entities?type=WeatherObserved&limit=500' \
--header 'Accept: application/ld+json' \
--header 'Link: <https://fiware.github.io/data-models/context.jsonld>; rel="http://www.w3.org/ns/json-ld#context"; type="application/ld+json"' \

In Orion v2 there is a maximum number of 1000. So this value might be expected to work here, too. Otherwise an error should be returned indicating that only a maximum value of x is allowed for 'limit'. Currently the LD simply does not react anymore after this fatal error has occurred. Of course this must not happen.

kzangeli commented 4 years ago

Yes, this is most definitely a bug. Orion-LD has a default limit of 1000, just like Orion. I'm pretty sure this is a problem with the buffer size of the rendered output. Easy to fix to simply make it work for such big buffers (allocate a buffer of 1 gigabytes :) ). Not so easy to streamline the calls to malloc and realloc ... I will need to take a look at this asap