FIWARE / context.Orion-LD

Context Broker and CEF building block for context data management which supports both the NGSI-LD and the NGSI-v2 APIs
https://www.etsi.org/deliver/etsi_gs/CIM/001_099/009/01.06.01_60/gs_CIM009v010601p.pdf
GNU Affero General Public License v3.0
50 stars 43 forks source link

Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

Closed FR-ADDIX closed 8 months ago

FR-ADDIX commented 11 months ago

ORION-LD V 1.4.0 Kubernetes v1.26.9

time=Wednesday 13 Dec 10:10:22 2023.838Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=mongocInit.cpp[58]:mongocLog | msg=MONGOC[mongoc]: Falling back to malloc for counters.                        
free(): double free detected in tcache 2                                                                                                                                                                                                      
Stream closed EOF for fiware/orion-ld-deployment-67cf4d9559-lscl2 (orion-ld)   
kzangeli commented 11 months ago

"tcache 2" ? Not a clue what that might be ... I¡m going to need some more info here to have a chance to look into this issue. The usual stuff, how the broker was started, what requests are sent to the broker ...

ibordach commented 10 months ago

This is very difficult to reproduce. The double free crash always comes with the "Falling back to malloc for counters" from mongoc.

    spec:
      containers:
      - args:
        - -dbhost
        - mongodb-headless.fiware.svc.cluster.local:27017/
        - -logLevel
        - DEBUG
        - -ctxTimeout
        - "10000"
        - -mongocOnly
        - -lmtmp
        - -reqPoolSize
        - "120"
        - -logSummary
        - "30"
        env:
        - name: ORIONLD_CORE_CONTEXT
          value: v1.0
        - name: ORIONLD_MONGO_REPLICA_SET
          value: rs0
        - name: ORIONLD_MONGO_USER
          value: root
        - name: ORIONLD_MONGO_TIMEOUT
          value: "5000"
        - name: ORIONLD_MONGO_POOL_SIZE
          value: "180"
        - name: ORIONLD_MONGO_ID_INDEX
          value: "TRUE"
        - name: ORIONLD_STAT_COUNTERS
          value: "TRUE"
        - name: ORIONLD_STAT_SEM_WAIT
          value: "TRUE"
        - name: ORIONLD_STAT_TIMING
          value: "TRUE"
        - name: ORIONLD_STAT_NOTIF_QUEUE
          value: "TRUE"
        - name: ORIONLD_SUBCACHE_IVAL
          value: "10"
        - name: ORIONLD_DISTRIBUTED
          value: "TRUE"
        - name: ORIONLD_CONTEXT_DOWNLOAD_TIMEOUT
          value: "10000"
        - name: ORIONLD_NOTIF_MODE
          value: threadpool:252:4
        - name: ORIONLD_DEBUG_CURL
          value: "TRUE"
ibordach commented 10 months ago

It looks looks like a race condition under "high" load.

ibordach commented 10 months ago

Currently we reduced our setup from 3 to one orion-ld and no subscriptions. The orion gets round about 2000 messages per second and permanently died with this double free error. The same happens with 3 orions or 6 orions.

FR-ADDIX commented 10 months ago

Are you using C++ and DynamicArray somewhere? "free(): double free detected in tcache 2" seems to be an error message if more elements are deleted from the array than are present.

kzangeli commented 10 months ago

Are you using C++ and DynamicArray somewhere?

Not a fan of C++. I'm a C guy ... I'm replacing most of the old C++ code of Orion for pure C in Orion-LD.

Now, if you could start the broker inside GDB or Valgrind, we'd get a stack trace. That would be extremely helpful to understand where the problem comes from. Either that, or a way for me to reproduce the crash.

ibordach commented 10 months ago

ok, we have the chance to get the core files. We also startet a pod with a gdb version of orion-ld. Now we have no debuginfos for gdb. How can we generate more information for you. I'm not a C guy. ;-)

Orion: 1.4.0-PRE-1425-GDB

(gdb) bt
#0  0x00000000004bb2a4 in ?? ()
#1  0x00000000004974ad in ?? ()
#2  0x00000000007fa132 in ?? ()
#3  0x00000000007fb870 in ?? ()
#4  0x0000000000803454 in ?? ()
#5  0x0000000000804105 in ?? ()
#6  0x00007fc6a18541cf in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc69cd36dd3 in clone () from /lib64/libc.so.6
(gdb)
ibordach commented 10 months ago

sorry for the Image image

kzangeli commented 10 months ago

ok, getting closer. mongoc_collection_destroy called twice. That's already a good hint. However, all in between is symbol less so, it could be better, I think we already create debug images on dockerhub. Would just be a question of using one of those instead (with debug symbols and not stripped). This one for example: 1.5.0-PRE-1551-debug (or just any image whose name ends in "-debug")

Now, when you go from RELEASE (-O2) compilation to DEBUG (-g) compilation, the executable changes. Quite a lot. Often. bugs disappear, or are moved elsewhere. Let's see if we're lucky!

ibordach commented 10 months ago

now with 1.5.0-PRE-1551-debug:

(gdb) bt
#0  0x00007f4cb1ed0ee1 in abort () from /lib64/libc.so.6
#1  0x00007f4cb1f40037 in __libc_message () from /lib64/libc.so.6
#2  0x00007f4cb1f4719c in malloc_printerr () from /lib64/libc.so.6
#3  0x00007f4cb1f48f0d in _int_free () from /lib64/libc.so.6
#4  0x00007f4cb2e1623e in mongoc_collection_destroy (collection=0x7f4c3000e250) at /cmake-3.14.5/mongo-c-driver-1.22.0/src/libmongoc/src/mongoc/mongoc-collection.c:246
#5  0x000000000048fa0d in requestCompleted(void*, MHD_Connection*, void**, MHD_RequestTerminationCode) ()
#6  0x0000000000807622 in MHD_connection_close_ (connection=connection@entry=0x7f4c30000d60, termination_code=<optimized out>) at connection.c:842
#7  0x0000000000808949 in connection_reset (reuse=false, connection=0x7f4c30000d60) at connection.c:4328
#8  MHD_connection_handle_idle (connection=connection@entry=0x7f4c30000d60) at connection.c:4832
#9  0x000000000080bbb9 in call_handlers (con=con@entry=0x7f4c30000d60, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1234
#10 0x0000000000810a48 in MHD_epoll (daemon=0x2245f90, millisec=<optimized out>) at daemon.c:5149
#11 0x0000000000811695 in MHD_polling_thread (cls=0x2245f90) at daemon.c:5386
#12 0x00007f4cb6a011cf in start_thread () from /lib64/libpthread.so.0
#13 0x00007f4cb1ee8dd3 in clone () from /lib64/libc.so.6
(gdb)
kzangeli commented 10 months ago

ok! Here's the code snippet:

  //                                                                                                                                                                                  
  // Release the connection to mongo - after all notifications have been sent (orionldAlterationsTreat takes care of that)                                                            
  //                                                                                                                                                                                  
  if (orionldState.mongoc.contextsP)       mongoc_collection_destroy(orionldState.mongoc.contextsP);
  if (orionldState.mongoc.entitiesP)       mongoc_collection_destroy(orionldState.mongoc.entitiesP);
  if (orionldState.mongoc.subscriptionsP)  mongoc_collection_destroy(orionldState.mongoc.subscriptionsP);
  if (orionldState.mongoc.registrationsP)  mongoc_collection_destroy(orionldState.mongoc.registrationsP);

I'll look into that. See if I can find the error. Would be so much easier if I could reproduce the error myself, but ...

FR-ADDIX commented 10 months ago

Can you run multiple Orion-LD processes and send them against a MogoDB replica set? Wir haben 3 Orion-LD die gegen eine MongoDB mit 2 Replikationen schreibt. We currently have around 26,000 entities in the database, so perhaps that is another factor that causes the error to be triggered more quickly. If we then write around 100 msg/sec against the CB, it runs stable for longer, if we increase the load to 1500 msg/sec, the crash can be produced relatively quickly.

kzangeli commented 10 months ago

ok, I can do that. Good input, thanks

DasNordlicht commented 10 months ago

We have now significantly reduced the entity and are currently still looking at around 2000 vehicles. This has made the Orion LD more stable.

However, this cannot be the solution, but only a temporary solution. The aim is to reach well over 100,000 entities.

kzangeli commented 10 months ago

Yeah, of course not. That's not the solution

ibordach commented 9 months ago

Are there any ideas how we can support this issue? Same crashes with 1.6.0-PRE-1561. That already causes us real pain.

kzangeli commented 9 months ago

Yeah, so sorry, this issue is long overdue. I can only say I'm sorry. Our development team is tiny, to say the least and these last few months I've been busy elsewhere, with various European projects and ETSI ISG CIM ...

That said, I took a quick look at this this morning and I've come up with a "desperate" attempt. Let's start there (I'll let you know when the docker image is ready) and hopefully I'll have some more time in the near future.

kzangeli commented 9 months ago

The PR https://github.com/FIWARE/context.Orion-LD/pull/1548 has been merged into develop and the docker image should reach dockerhub in 15-20 minutes. It's usually that fast.

Now, this is not that I've found the issue and fixed it. We'll see if the modification changes anything.

ibordach commented 9 months ago

:-) We will test it as soon as it will be available.

kzangeli commented 9 months ago

1.6.0-PRE-1574

ibordach commented 9 months ago

up and running :-), we will check it

kzangeli commented 9 months ago

So, this "fix" is about MHD (3rd party lib for HTTP) calls the "request completed" callback more than once for one and the same connection. That would be one way for the error to appear. I kind of doubt it though.

Another possibility is that mongoc_client_get_collection returns the same mongoc_collection_t twice. Also hard to believe, but I just saw that 'mongo_client_t` is not thread safe, and a client pointer is input to that function. So, a semaphore there, around getting collections seems like a good idea. I'll implement that now and we'll test again once the first test fails :)

kzangeli commented 9 months ago

So, I merged that second fix as well, semaphore around getting the connection to mongo. The docker image as usual in 15 min.

I'm far from convinced that any of this will fix the issue you're seeing, but ... it might ;)

ibordach commented 9 months ago

crash with 1574, now testing 1575

ibordach commented 9 months ago

A first cautious feedback after round about 2 and a half hours: so far we have not seen this type of error and only running into this issue https://github.com/FIWARE/context.Orion-LD/issues/1441

kzangeli commented 9 months ago

ok, after a week or so we might assume the semaphore fixed the problem.

While looking at this yesterday, I found an alternative to using that "thread unsafe" mongo_client_t. If this semaphore fixed the problem, then surely mongo_client_t is the culprit. I'll try top replace all of those for the alternative. Abd, remove the semaphore, as it is not really needed (only the usage of mongo_client_t makes it needed)

kzangeli commented 8 months ago

So, still no more crashes of that type?

I'll update #1441 right now.

ibordach commented 8 months ago

We only notice the FD_SETSIZE error since your semaphore fix. :-)

kzangeli commented 8 months ago

We only notice the FD_SETSIZE error since your semaphore fix. :-) That is interesting indeed. The extra semaphore slightly changes the timing of the broker. It makes a bit slower. Of course it means that the change to mongoc 1.24 won't make any difference, but that was just a "hit in the dark" anyway.

Now, with this piece of information it seems almost clear to me that you suffer from a lack of resources. It really seems that way. In the end it's the broker's fault, should manage resources and slow down incoming requests and stuff like that. Wish I had the resources to implement all that! But I don't, unfortunately.

My suggestion then is that you add resources, to both the broker and to mongo and up the max FD size to a million. See how things turn out. It doesn't seem like a leak of FDs. It seems like the broker is too overloaded.

ibordach commented 8 months ago

I don't think that the broker is too overloaded. It consumes about 15% of a CPU core and about 100MB of RAM. To address the FD problem we started 6 orions. The mongodb also has no ressource issues. It consumes 50% of a CPU core and about 2GB of RAM. And there are plenty of cores and RAM available.

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

ibordach commented 8 months ago

The mongodb has about 2 * 300 open connections.

ibordach commented 8 months ago

manual restart of the orions in the morning and reducing it to 5 instances: image image

kzangeli commented 8 months ago

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

So, this comment wasn't really correct then :)

We only notice the FD_SETSIZE error since your semaphore fix. :-)

ibordach commented 8 months ago

Yes, it is the last error left. :-)

kzangeli commented 8 months ago

Now, if mongo has 600 connection, but you get to FDs over 1565, we need to find those other almost 1000 FDs ... Any info you find will help me to understand where the error is.

That said, I'd up the FD MAX to a LARGE number and hope it's just normal. Slow mongo, for example. Frankly, I have no clue what is causing this problem. Very interested in fixing it though

kzangeli commented 8 months ago

Question no 1 is:

ibordach commented 8 months ago

We only get tons of this within a second and after that a crash (aka restart)

New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).

1441

kzangeli commented 8 months ago

Yeah, we need more input. We already removed the Prometheus thingy. We also tried without notifications (zero subscriptions). You don't use registrations (right?) The next step would be the new mongoc driver and I'm on that, just, doesn't look too good about fixing this.

I'll try to find something more to "rule out".

Not easy, but, we'll get there

kzangeli commented 8 months ago

We finally solved the base image issue (not 100% but almost). There's a new image on dockerhub:
1.6.0-PRE-1587

That one uses the newer version 1.24.2 of mongoc that actually supports mongodb 7.0. So, let's try again and see what happens ...

Would be nice also to test with a less recent release of the mongo server. 4.4, 5.0, 6.0 ... Not that we're a bit desperate ... :)

ibordach commented 8 months ago

We wave a cluster with mongodb 6 and one with mongodb 7. Both are now running with 1.6.0-PRE-1587. We will see. :-)

ibordach commented 8 months ago

The bandwidth of one orion-ld instance: image at 10:15 we startet some more orion instances, so the traffic is somewhat lower per instance.

the replicaset of mongodb has no stress. The number of connections increases slowly over time. bandwidth usage is relaxed image

kzangeli commented 8 months ago

So, not looking good then :( If we only knew what FD is growing !!!

ibordach commented 8 months ago

the instance consumes a little bit of ram an then dies. image

kzangeli commented 8 months ago

Different crash then? Not too many open file descriptors?

ibordach commented 8 months ago

Same error FD_SETSIZE

kzangeli commented 8 months ago

I did a quick search on the error (should have done that ages ago ...). The "article" is old, I know, but it might still be interesting. Have a look at it and tell me what you think: https://www.mongodb.com/community/forums/t/mongodb-4-2-19-extremely-high-file-descriptor-counts-70k/170223

kzangeli commented 8 months ago

Ah, we're commenting in the wrong issue. Mea culpa.

We should be in #1441. I'll move my last comment there. This issue is about the crash that was presumable fixed with that semaphore I added. Can this issue be closed? (and we continue the discussion in #1441?)

ibordach commented 8 months ago

Yes, this can be closed.