NREL / api-umbrella

Open source API management platform
http://apiumbrella.io
MIT License
2.02k stars 324 forks source link

Data integrity issue under high load #537

Open profijoeln opened 3 years ago

profijoeln commented 3 years ago

Tested on "latest" image - nrel/api-umbrella@sha256:094c6dc5e96eff5745ba432437c16579b6cc4b7e6a044de2ad902126e8d21fe1

When system was overloaded, with system load average that exceeds system processors for longer period of time, we started noticing misspelled data in our system. Load average was 10+/4cpu for 5+ minute periods. Errors appear only when system is overloaded.

In load test scenario we send average of 400-600 messages/s to Orion Context-Broker (https://fiware-orion.readthedocs.io/en/2.4.2/), through Umbrella. Loadtest sent payloads were recorded to make sure error was not in sent data.

Sent payload has 2 attributes, "precipitation" and "relativehumidity". But in databases we find following attributes:

pecipitation        LONG
pprecipitation      LONG
prcipitation        LONG
preccipitation      LONG
preciitation        LONG
precipiation        LONG
precipiitation      LONG
precipipitation     LONG
precipitaation      LONG
precipitaion        LONG
precipitatiion      LONG
precipitatin        LONG
precipitatio        LONG
precipitation       FLOAT
precipitationn      LONG
precipitatioon      LONG
precipitaton        LONG
precipitattion      LONG
precipition     LONG
precipititation     LONG
precipittation      LONG
preciptation        LONG
precitation     LONG
precpitation        LONG
prprecipitation     LONG
prrecipitation      LONG
recipitation        LONG
relativehumiditty   LONG
relativehumidity    FLOAT
relativemidity      LONG
rellativehumidity   LONG
rlativehumidity     LONG

We tested the same system with Nginx as reverse-proxy with increased load sent to system but we were not able to reproduce this error with Nginx.

EDIT: When these errors appeared on the system, we also recorded some 502 Errors as response to the sent messages.

Testing was done in Kubernetes environment on Debian. Error is present with and without SSL termination. No system or component crashes during testing.

We expected massive data-loss but we also experienced loss of data integrity.

ccsr commented 3 years ago

@profijoeln is there any update about this? Logs or payload?