anacostiaAI / anacostia-pipeline

Anacostia is a framework for creating machine learning operations (MLOps) pipelines
Apache License 2.0
1 stars 2 forks source link

Fix issue with /retrieve_file endpoint #20

Open mdo6180 opened 1 month ago

mdo6180 commented 1 month ago
mdo6180 commented 1 month ago
image

I believe the issue is due to a starvation condition where the @resource_accessor decorator is taking too long to acquire the lock (BaseResourceNode.resource_lock).

mdo6180 commented 1 month ago

According to this answer on stackoverflow, the solution might be to add a tiny sleep time.sleep(0.001) prior to acquiring the lock in order to force a task switch from the greedy thread (I believe the greedy thread here is the thread used for monitoring the filesystem directory) to enable the thread calling the @resource_accessor decorator to acquire the lock.

mdo6180 commented 1 month ago

removing the @resource_accessor decorator from FilesystemStoreNode.get_artifact() resolved the starvation issue. however, this might not be a good fix because i would still like to have some sort of locking mechanism to ensure proper resource access and synchronization.

mdo6180 commented 1 month ago
image

It appears adding time.sleep(0.1) in the _monitor_thread_func in the FilesystemStoreNode.start_monitoring method helped to resolve the issue. I think what happened was that _monitor_thread_func was a pretty "busy" function, i.e., always acquiring the lock and using CPU cycles to call os.listdir() to check on what files are currently available in the directory, hence why none of the other threads could execute.

It also explains why making changes in the @BaseResourceNode.resource_accessor decorator did not help at all; because the decorator was only placed on the start_monitoring method meanwhile it should have been placed on the _monitor_thread_func. The start_monitoring method didn't really do anything other than create a thread for the _monitor_thread_func function inside of it to execute. Thus, placing the decorator on the start_monitoring method meant that the decorator would only execute once at the beginning when start_monitoring is called and thus, it didn't have any effect in forcing the thread running _monitor_thread_func to release the lock.

mdo6180 commented 1 month ago

See commit that resolved the issue here

mdo6180 commented 1 month ago

although the issue is resolved, i think it's still important to think about a way to abstract the _monitor_thread_func away from the user. I.e., perhaps create the loop for them, so that way we can call sleep automatically for them, and also start the thread for them.

mdo6180 commented 1 month ago