fd leak when accessing an xrootd/s3 origin from certain caches

turetske commented 1 month ago

When using certain caches to access an xrootd origin, there seems to be a large amount of fds created and leaking. It can get as high as 3000 for example.

There also seem to be a large number of threads, at least as far as looking at pstack is concerned. They don't seem to go away once the accesses from the cache are stopped, instead seem to be in a stuck state until the pod has to restart itself because it's been overloaded. So multiple calls via one of these caches just increase the number of total threads.

Attached is a file showing a stack of those threads.

Note that this only seems to happen with certain caches. For example, it always occurs with https://dtn-pas.denv.nrp.internet2.edu:8443, but never occurs with the sdsc or osg-chicago caches or with any accesses that read directly from the origin without going through one of these problematic caches.

stalled-threads.pdf

Also, the objects need to be sufficiently large for this to occur, as well. Not sure what the threshold is, but these caches don't cause the issue with small files.

bbockelm commented 1 month ago

How certain are we that there's a FD leak?

Looking over the stack traces, I see nothing that's obviously deadlocked. Lots of things are waiting on locks -- and in the middle of being processed.

I see the origin is set to a single core which suggests that there's enormously more load than what the single server can handle. We could consider increasing the core count (there's no particular need to be stingy here) or adding a throttle to the origin.

jhiemstrawisc commented 1 month ago

With the recent slate of other bugs we've found that might have explained this behavior, can we close this issue?

bbockelm commented 1 month ago

Yes.

FWIW -- there was never any resource leak here. However, there was a XCache bug that caused immense load to be put on the origin unnecessarily -- and the interaction with libcurl was incredibly inefficient (in testing, we couldn't find a server large enough to efficiently handle the load from a cache).

So, the "leaks" were never leaks but simply wildly overwhelmed origins.

PelicanPlatform / xrootd-s3-http

fd leak when accessing an xrootd/s3 origin from certain caches #46