Closed FR-ADDIX closed 8 months ago
"tcache 2" ? Not a clue what that might be ... I¡m going to need some more info here to have a chance to look into this issue. The usual stuff, how the broker was started, what requests are sent to the broker ...
This is very difficult to reproduce. The double free crash always comes with the "Falling back to malloc for counters" from mongoc.
spec:
containers:
- args:
- -dbhost
- mongodb-headless.fiware.svc.cluster.local:27017/
- -logLevel
- DEBUG
- -ctxTimeout
- "10000"
- -mongocOnly
- -lmtmp
- -reqPoolSize
- "120"
- -logSummary
- "30"
env:
- name: ORIONLD_CORE_CONTEXT
value: v1.0
- name: ORIONLD_MONGO_REPLICA_SET
value: rs0
- name: ORIONLD_MONGO_USER
value: root
- name: ORIONLD_MONGO_TIMEOUT
value: "5000"
- name: ORIONLD_MONGO_POOL_SIZE
value: "180"
- name: ORIONLD_MONGO_ID_INDEX
value: "TRUE"
- name: ORIONLD_STAT_COUNTERS
value: "TRUE"
- name: ORIONLD_STAT_SEM_WAIT
value: "TRUE"
- name: ORIONLD_STAT_TIMING
value: "TRUE"
- name: ORIONLD_STAT_NOTIF_QUEUE
value: "TRUE"
- name: ORIONLD_SUBCACHE_IVAL
value: "10"
- name: ORIONLD_DISTRIBUTED
value: "TRUE"
- name: ORIONLD_CONTEXT_DOWNLOAD_TIMEOUT
value: "10000"
- name: ORIONLD_NOTIF_MODE
value: threadpool:252:4
- name: ORIONLD_DEBUG_CURL
value: "TRUE"
It looks looks like a race condition under "high" load.
Currently we reduced our setup from 3 to one orion-ld and no subscriptions. The orion gets round about 2000 messages per second and permanently died with this double free error. The same happens with 3 orions or 6 orions.
Are you using C++ and DynamicArray somewhere? "free(): double free detected in tcache 2" seems to be an error message if more elements are deleted from the array than are present.
Are you using C++ and DynamicArray somewhere?
Not a fan of C++. I'm a C guy ... I'm replacing most of the old C++ code of Orion for pure C in Orion-LD.
Now, if you could start the broker inside GDB or Valgrind, we'd get a stack trace. That would be extremely helpful to understand where the problem comes from. Either that, or a way for me to reproduce the crash.
ok, we have the chance to get the core files. We also startet a pod with a gdb version of orion-ld. Now we have no debuginfos for gdb. How can we generate more information for you. I'm not a C guy. ;-)
Orion: 1.4.0-PRE-1425-GDB
(gdb) bt
#0 0x00000000004bb2a4 in ?? ()
#1 0x00000000004974ad in ?? ()
#2 0x00000000007fa132 in ?? ()
#3 0x00000000007fb870 in ?? ()
#4 0x0000000000803454 in ?? ()
#5 0x0000000000804105 in ?? ()
#6 0x00007fc6a18541cf in start_thread () from /lib64/libpthread.so.0
#7 0x00007fc69cd36dd3 in clone () from /lib64/libc.so.6
(gdb)
sorry for the Image
ok, getting closer. mongoc_collection_destroy called twice. That's already a good hint. However, all in between is symbol less so, it could be better, I think we already create debug images on dockerhub. Would just be a question of using one of those instead (with debug symbols and not stripped). This one for example: 1.5.0-PRE-1551-debug (or just any image whose name ends in "-debug")
Now, when you go from RELEASE (-O2) compilation to DEBUG (-g) compilation, the executable changes. Quite a lot. Often. bugs disappear, or are moved elsewhere. Let's see if we're lucky!
now with 1.5.0-PRE-1551-debug:
(gdb) bt
#0 0x00007f4cb1ed0ee1 in abort () from /lib64/libc.so.6
#1 0x00007f4cb1f40037 in __libc_message () from /lib64/libc.so.6
#2 0x00007f4cb1f4719c in malloc_printerr () from /lib64/libc.so.6
#3 0x00007f4cb1f48f0d in _int_free () from /lib64/libc.so.6
#4 0x00007f4cb2e1623e in mongoc_collection_destroy (collection=0x7f4c3000e250) at /cmake-3.14.5/mongo-c-driver-1.22.0/src/libmongoc/src/mongoc/mongoc-collection.c:246
#5 0x000000000048fa0d in requestCompleted(void*, MHD_Connection*, void**, MHD_RequestTerminationCode) ()
#6 0x0000000000807622 in MHD_connection_close_ (connection=connection@entry=0x7f4c30000d60, termination_code=<optimized out>) at connection.c:842
#7 0x0000000000808949 in connection_reset (reuse=false, connection=0x7f4c30000d60) at connection.c:4328
#8 MHD_connection_handle_idle (connection=connection@entry=0x7f4c30000d60) at connection.c:4832
#9 0x000000000080bbb9 in call_handlers (con=con@entry=0x7f4c30000d60, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1234
#10 0x0000000000810a48 in MHD_epoll (daemon=0x2245f90, millisec=<optimized out>) at daemon.c:5149
#11 0x0000000000811695 in MHD_polling_thread (cls=0x2245f90) at daemon.c:5386
#12 0x00007f4cb6a011cf in start_thread () from /lib64/libpthread.so.0
#13 0x00007f4cb1ee8dd3 in clone () from /lib64/libc.so.6
(gdb)
ok! Here's the code snippet:
//
// Release the connection to mongo - after all notifications have been sent (orionldAlterationsTreat takes care of that)
//
if (orionldState.mongoc.contextsP) mongoc_collection_destroy(orionldState.mongoc.contextsP);
if (orionldState.mongoc.entitiesP) mongoc_collection_destroy(orionldState.mongoc.entitiesP);
if (orionldState.mongoc.subscriptionsP) mongoc_collection_destroy(orionldState.mongoc.subscriptionsP);
if (orionldState.mongoc.registrationsP) mongoc_collection_destroy(orionldState.mongoc.registrationsP);
I'll look into that. See if I can find the error. Would be so much easier if I could reproduce the error myself, but ...
Can you run multiple Orion-LD processes and send them against a MogoDB replica set? Wir haben 3 Orion-LD die gegen eine MongoDB mit 2 Replikationen schreibt. We currently have around 26,000 entities in the database, so perhaps that is another factor that causes the error to be triggered more quickly. If we then write around 100 msg/sec against the CB, it runs stable for longer, if we increase the load to 1500 msg/sec, the crash can be produced relatively quickly.
ok, I can do that. Good input, thanks
We have now significantly reduced the entity and are currently still looking at around 2000 vehicles. This has made the Orion LD more stable.
However, this cannot be the solution, but only a temporary solution. The aim is to reach well over 100,000 entities.
Yeah, of course not. That's not the solution
Are there any ideas how we can support this issue? Same crashes with 1.6.0-PRE-1561. That already causes us real pain.
Yeah, so sorry, this issue is long overdue. I can only say I'm sorry. Our development team is tiny, to say the least and these last few months I've been busy elsewhere, with various European projects and ETSI ISG CIM ...
That said, I took a quick look at this this morning and I've come up with a "desperate" attempt. Let's start there (I'll let you know when the docker image is ready) and hopefully I'll have some more time in the near future.
The PR https://github.com/FIWARE/context.Orion-LD/pull/1548 has been merged into develop and the docker image should reach dockerhub in 15-20 minutes. It's usually that fast.
Now, this is not that I've found the issue and fixed it. We'll see if the modification changes anything.
:-) We will test it as soon as it will be available.
up and running :-), we will check it
So, this "fix" is about MHD (3rd party lib for HTTP) calls the "request completed" callback more than once for one and the same connection. That would be one way for the error to appear. I kind of doubt it though.
Another possibility is that mongoc_client_get_collection
returns the same mongoc_collection_t
twice.
Also hard to believe, but I just saw that 'mongo_client_t` is not thread safe, and a client pointer is input to that function.
So, a semaphore there, around getting collections seems like a good idea.
I'll implement that now and we'll test again once the first test fails :)
So, I merged that second fix as well, semaphore around getting the connection to mongo. The docker image as usual in 15 min.
I'm far from convinced that any of this will fix the issue you're seeing, but ... it might ;)
crash with 1574, now testing 1575
A first cautious feedback after round about 2 and a half hours: so far we have not seen this type of error and only running into this issue https://github.com/FIWARE/context.Orion-LD/issues/1441
ok, after a week or so we might assume the semaphore fixed the problem.
While looking at this yesterday, I found an alternative to using that "thread unsafe" mongo_client_t. If this semaphore fixed the problem, then surely mongo_client_t is the culprit. I'll try top replace all of those for the alternative. Abd, remove the semaphore, as it is not really needed (only the usage of mongo_client_t makes it needed)
So, still no more crashes of that type?
I'll update #1441 right now.
We only notice the FD_SETSIZE error since your semaphore fix. :-)
We only notice the FD_SETSIZE error since your semaphore fix. :-) That is interesting indeed. The extra semaphore slightly changes the timing of the broker. It makes a bit slower. Of course it means that the change to mongoc 1.24 won't make any difference, but that was just a "hit in the dark" anyway.
Now, with this piece of information it seems almost clear to me that you suffer from a lack of resources. It really seems that way. In the end it's the broker's fault, should manage resources and slow down incoming requests and stuff like that. Wish I had the resources to implement all that! But I don't, unfortunately.
My suggestion then is that you add resources, to both the broker and to mongo and up the max FD size to a million. See how things turn out. It doesn't seem like a leak of FDs. It seems like the broker is too overloaded.
I don't think that the broker is too overloaded. It consumes about 15% of a CPU core and about 100MB of RAM. To address the FD problem we started 6 orions. The mongodb also has no ressource issues. It consumes 50% of a CPU core and about 2GB of RAM. And there are plenty of cores and RAM available.
The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.
The mongodb has about 2 * 300 open connections.
manual restart of the orions in the morning and reducing it to 5 instances:
The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.
So, this comment wasn't really correct then :)
We only notice the FD_SETSIZE error since your semaphore fix. :-)
Yes, it is the last error left. :-)
Now, if mongo has 600 connection, but you get to FDs over 1565, we need to find those other almost 1000 FDs ... Any info you find will help me to understand where the error is.
That said, I'd up the FD MAX to a LARGE number and hope it's just normal. Slow mongo, for example. Frankly, I have no clue what is causing this problem. Very interested in fixing it though
Question no 1 is:
We only get tons of this within a second and after that a crash (aka restart)
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
Yeah, we need more input. We already removed the Prometheus thingy. We also tried without notifications (zero subscriptions). You don't use registrations (right?) The next step would be the new mongoc driver and I'm on that, just, doesn't look too good about fixing this.
I'll try to find something more to "rule out".
Not easy, but, we'll get there
We finally solved the base image issue (not 100% but almost).
There's a new image on dockerhub:
1.6.0-PRE-1587
That one uses the newer version 1.24.2 of mongoc that actually supports mongodb 7.0. So, let's try again and see what happens ...
Would be nice also to test with a less recent release of the mongo server. 4.4, 5.0, 6.0 ... Not that we're a bit desperate ... :)
We wave a cluster with mongodb 6 and one with mongodb 7. Both are now running with 1.6.0-PRE-1587. We will see. :-)
The bandwidth of one orion-ld instance: at 10:15 we startet some more orion instances, so the traffic is somewhat lower per instance.
the replicaset of mongodb has no stress. The number of connections increases slowly over time. bandwidth usage is relaxed
So, not looking good then :( If we only knew what FD is growing !!!
the instance consumes a little bit of ram an then dies.
Different crash then? Not too many open file descriptors?
Same error FD_SETSIZE
I did a quick search on the error (should have done that ages ago ...). The "article" is old, I know, but it might still be interesting. Have a look at it and tell me what you think: https://www.mongodb.com/community/forums/t/mongodb-4-2-19-extremely-high-file-descriptor-counts-70k/170223
Ah, we're commenting in the wrong issue. Mea culpa.
We should be in #1441. I'll move my last comment there. This issue is about the crash that was presumable fixed with that semaphore I added. Can this issue be closed? (and we continue the discussion in #1441?)
Yes, this can be closed.
ORION-LD V 1.4.0 Kubernetes v1.26.9