Open mdo6180 opened 1 month ago
I believe the issue is due to a starvation condition where the @resource_accessor
decorator is taking too long to acquire the lock (BaseResourceNode.resource_lock
).
According to this answer on stackoverflow, the solution might be to add a tiny sleep time.sleep(0.001)
prior to acquiring the lock in order to force a task switch from the greedy thread (I believe the greedy thread here is the thread used for monitoring the filesystem directory) to enable the thread calling the @resource_accessor
decorator to acquire the lock.
removing the @resource_accessor
decorator from FilesystemStoreNode.get_artifact()
resolved the starvation issue. however, this might not be a good fix because i would still like to have some sort of locking mechanism to ensure proper resource access and synchronization.
It appears adding time.sleep(0.1)
in the _monitor_thread_func
in the FilesystemStoreNode.start_monitoring
method helped to resolve the issue. I think what happened was that _monitor_thread_func
was a pretty "busy" function, i.e., always acquiring the lock and using CPU cycles to call os.listdir()
to check on what files are currently available in the directory, hence why none of the other threads could execute.
It also explains why making changes in the @BaseResourceNode.resource_accessor
decorator did not help at all; because the decorator was only placed on the start_monitoring
method meanwhile it should have been placed on the _monitor_thread_func
. The start_monitoring
method didn't really do anything other than create a thread for the _monitor_thread_func
function inside of it to execute. Thus, placing the decorator on the start_monitoring
method meant that the decorator would only execute once at the beginning when start_monitoring
is called and thus, it didn't have any effect in forcing the thread running _monitor_thread_func
to release the lock.
although the issue is resolved, i think it's still important to think about a way to abstract the _monitor_thread_func
away from the user. I.e., perhaps create the loop for them, so that way we can call sleep automatically for them, and also start the thread for them.
_monitor_thread_func
and the start_monitoring
method.
The /retrieve_file endpoint is taking too long to execute.
The issue in the FilesystemStoreNode.get_artifact(self, id: int) method.
The call to that method in line 124 of FilesystemStoreNodeApp is taking too long.
[x] Use cProfiler and SnakeViz to see where the bottleneck is.