SophistSolutions / WhyTheFuckIsMyNetworkSoSlow

WhyTheFuckIsMyNetworkSoSlow is a network performance analysis tool, making it easy to quickly examine a network and see what is wrong, and to evaluate longer term trends.
15 stars 1 forks source link

On Linux (so far only seen on release builds) we sometimes get Device or resource busy storm and then app (get devices at least) hangs #86

Closed LewisPringle closed 1 year ago

LewisPringle commented 1 year ago

Mar 09 10:00:05 Hercules WhyTheFuckIsMyNetworkSoSlow[1933441]: Database operation exception: Exception: Device or resource busy (lastsql 'INSERT INTO DeviceUserSettings (DeviceID, _otherfields) VALUES ('02cb6177-14e0-8aa8-7041-1b6dbec7…')

LewisPringle commented 1 year ago

SQLITE_OPENNOMUTEX (PRIORITY TRY NEXT) Verufy this is the rgith mode and we handle it correctly DESPITE DOCS - MABE BETTER TO USE FULL_MUTEX (changed/testing as of 2022-03-02)

DIDNT SEEM TO help : as of 2023-03-09 at home:

Mar 09 10:13:20 Hercules WhyTheFuckIsMyNetworkSoSlow[1933441]: Database operation exception: Exception: Device or resource busy (lastsql 'INSERT INTO DeviceUserSettings (DeviceID, _otherfields) VALUES ('57d641a7-2976-bae4-f50c-0c2b04f9…') Mar 09 10:13:50 Hercules WhyTheFuckIsMyNetworkSoSlow[1933441]: Database operation exception: Exception: Device or resource busy (lastsql 'INSERT INTO DeviceUserSettings (DeviceID, _otherfields) VALUES ('918dc28c-e3eb-bd54-ea19-379a01bb…') Mar 09 10:14:20 Hercules WhyTheFuckIsMyNetworkSoSlow[1933441]: Database operation exception: Exception: Device or resource busy (lastsql 'INSERT INTO DeviceUserSettings (DeviceID, _otherfields) VALUES ('e66b958f-076d-15de-0baa-6560

So must try something else - maybe as a test - do my own mutex around this stuff and see if works stably with that (just as a test).
LewisPringle commented 1 year ago

Tried running tsan version and no errors after a couple days. Tried running asan /ubsan version - and crashed every few days, but without clear error (was somewhat consistent, after a getDevice WSAPI call- returning thread to pool roughly - not clear from errors.

LewisPringle commented 1 year ago

theorey - DBAccess::Mgr - though documented interally sycnchonized, somehow may not be properly.

LewisPringle commented 1 year ago

Tried running on Linux debug build under debugger and got crash in

055556c0b7d85 in Stroika::Foundation::Containers::DataStructures::LinkedList<std::sharedptr >::RemoveAt (this=0x629002483ad8, i=...) at ../../Foundation/Characters/../Containers/Factory/../Concrete/../DataStructures/LinkedList.inl:275 275 for (prevLink = this->fHead; prevLink->fNext != victim; prevLink = prevLink->fNext) { (gdb) bt

0 0x000055556c0b7d85 in Stroika::Foundation::Containers::DataStructures::LinkedList<std::shared_ptr >::RemoveAt (this=0x629002483ad8, i=...)

at ../../Foundation/Characters/../Containers/Factory/../Concrete/../DataStructures/LinkedList.inl:275

1 0x000055556c0a3186 in Stroika::Foundation::Containers::Concrete::Collection_LinkedList<std::sharedptr >::Rep::Remove (

this=0x629002483ad0, i=..., nextI=0x0) at ../../Foundation/Characters/../Memory/../Containers/Adapters/../Factory/../Concrete/Collection_LinkedList.inl:122

2 0x000055556bfed79b in Stroika::Foundation::Containers::Collection<std::shared_ptr >::Remove<std::equal_to<std::shared_ptr > > (this=0x624001f8fc08, item=..., equalsComparer=...) at ../../Foundation/Characters/../Memory/../Containers/Adapters/../Collection.inl:125

3 0x000055556bfc951a in operator() (__closure=0x603000280fc0) at ConnectionManager.cpp:275

4 0x000055556bfdae49 in std::invokeimpl<void, Stroika::Frameworks::WebServer::ConnectionManager::WaitForReadyConnectionLoop()::<lambda()>&>(std::invoke_other, struct {...} &) (

__f=...) at /usr/include/c++/10/bits/invoke.h:60

5 0x000055556bfd8221 in std::invoker<void, Stroika::Frameworks::WebServer::ConnectionManager::WaitForReadyConnectionLoop()::<lambda()>&>(struct {...} &) (fn=...)

at /usr/include/c++/10/bits/invoke.h:110

6 0x000055556bfd4d71 in std::_Functionhandler<void(), Stroika::Frameworks::WebServer::ConnectionManager::WaitForReadyConnectionLoop()::<lambda()> >::_M_invoke(const std::_Any_data &) (

__functor=...) at /usr/include/c++/10/bits/std_function.h:291

--Type for more, q to quit, c to continue without paging--

so halting Stroika release, and trying to debug. Seems EITHER a bug with linked list code (hard to believe) or iwth the syncrhonzied locking code (hard to see). Or something really subtle..

LewisPringle commented 1 year ago

interesting notes from crash:

time til crash:216322 seconds

crash removing 0x6230001ae910 [0015][216321.217] ***ConnectionMgr: REMOVING readyConnection=0x6230001ae910 from fActiveConnections_

(running at same time [0020][216321.696] ***ConnectionMgr: REMOVING readyConnection=0x62300021b110 from fActiveConnections_ (after) - but looking in debugger, not interfering - really locked!

nothing obvious

trying Collection_Array instead of Collection_LinkedList and see if has any effect

LewisPringle commented 1 year ago

stroing theorey:

inline void Collection<T>::Remove (ArgByValueType<value_type> item, const EQUALS_COMPARER& equalsComparer)
{
    // TRIUED MOVING REP ACCESSOR - BUT MAYBE ALSO GOOD TO USE
    //         auto [writerRep, patchedIterator] = _GetWritableRepAndPatchAssociatedIterator (i);

    // CRITICALLY DOC NEED FOR CALLING _GetWritableRepAndPatchAssociatedIterator instead of repAccessor._GetWriteableRep

    _SafeReadWriteRepAccessor<_IRep> repAccessor {this};
    auto i = this->Find (item, equalsComparer);
    Require (i != this->end ()); // use remove-if if the item might not exist
    repAccessor._GetWriteableRep ().Remove (i, nullptr);
}
LewisPringle commented 1 year ago

todo search for use of use of iterator in call to thing after _SafeReadWriteRepAccessor

LewisPringle commented 1 year ago

I think fixed in Stk 2.1.13 - testing

LewisPringle commented 1 year ago

fixed in latest stroika 2.1.13