Tango concurrency and locking issues

There are potential concurrency issues in parts of the codebase that results in execution traces which causes Tango to behave anomalously.

See: https://github.com/autolab/Tango/issues/138 (was replicable until very recently, but we used a workaround to fix it instead of resolving the root issue of why it was happening)

Hey Team! I'm interested in working on this issue and I'd like to clarify my understanding of the code: (These might or might not be related to the actual issue raised above)

allocVM: Firstly, is there a reason why the lock is released in the case where it was not acquired? This may cause some concurrency issues in future. Secondly, it seems it is possible that None is returned in this method. However, this might cause an exception to be thrown here. This might not be intentional because it seems like you handle the case when a vm is not preallocated in the worker itself. However, because of this logging, the worker thread doesn't even get to run. I'm also a little confused on what is supposed to happen if there were no such logging exception and a worker was successfully created -- it seems like you create a new vm with the job id here. Is it possible if I get some clarification on this? This would be really helpful!
getNextPendingJobReuse: The only use case of this method is in the jobManager when an id is passed in. It seems that a lot of redundant work is being done since it firstly creates the entire list of liveJobs and then loops through this list to get the target_id. Besides, in the previous case, I think it is also possible that you return the job (and the id) from the call to getNextPendingJob here. This might help to entirely remove the need to repeatedly get the lock on the jobqueue to lookup and get the job object. I might not have been understanding this correctly, but is there a reason why the code is designed this way?
Besides, it seems to be implied here that there might be multiple consumers of the jobQueue. I'm a little confused whether this is the case, or is this just handling the case where there is no pending jobs to be done? I think it might be possible to add a TangoQueue that keeps track of a fifo queue of unassigned jobs so you can easily pop them off -- currently, it seems like we loop through the liveJobs and find one job that is not marked as assigned, which might cause quite a long wait time if the unassigned jobs are at the end of the hash dictionary (or even maybe starvation, since it seems like jobs with ids that hash to a key which cause them to appear at the start of the dictionary might repeatedly have first priority? not sure about either of these points). If there are multiple consumers of the jobQueue, it might also help to hold the lock until you mark the job assigned to prevent the case where other consumers cannot find the job because it is completed by another consumer (implied by the comment here). I might not be fully aware of the context of this, so some advice here would be very helpful.

Thank you sm in advance! :-)

Hey Team! I'm interested in working on this issue and I'd like to clarify my understanding of the code: (These might or might not be related to the actual issue raised above)

allocVM: Firstly, is there a reason why the lock is released in the case where it was not acquired? This may cause some concurrency issues in future.

Yes, I agree that this looks like a bug. From the Python docs, this would raise a RuntimeError if the an unlocked lock was released.

Secondly, it seems it is possible that None is returned in this method. However, this might cause an exception to be thrown here. This might not be intentional because it seems like you handle the case when a vm is not preallocated in the worker itself. However, because of this logging, the worker thread doesn't even get to run.

This is a good catch. The logging statement should case on whether preVM is None or not before trying to use its value.

I'm also a little confused on what is supposed to happen if there were no such logging exception and a worker was successfully created -- it seems like you create a new vm with the job id here. Is it possible if I get some clarification on this? This would be really helpful!

It seems like the if condition before is first checking whether this is a job that should be run on AWS EC2 instances. accessKeyId is only used for EC2. Otherwise, it tries to find a VM. This can be done by either:

Using an existing preallocated VM
Obtaining a new VM

I agree that the defaykt naming preVM is also confusing as well. I believe it stands for preallocated VM, and I suspect that the code was originally written originally to only support preallocated VM pools, and the ability to spin up instances on demand was added afterwards. This naming should be updated to reflect what it actually is.

getNextPendingJobReuse: The only use case of this method is in the jobManager when an id is passed in. It seems that a lot of redundant work is being done since it firstly creates the entire list of liveJobs and then loops through this list to get the target_id. Besides, in the previous case, I think it is also possible that you return the job (and the id) from the call to getNextPendingJob here. This might help to entirely remove the need to repeatedly get the lock on the jobqueue to lookup and get the job object. I might not have been understanding this correctly, but is there a reason why the code is designed this way?

I am not the author of the code so I can't speak to the original design, but I do agree that there is a lot of redundant lookups. We could change getNextPendingJob to return both id and job to avoid having to make a redis query again to find the id again in getNextPendingJobReuse.

Besides, it seems to be implied here that there might be multiple consumers of the jobQueue. I'm a little confused whether this is the case, or is this just handling the case where there is no pending jobs to be done?

Based on this supervisord config from the old one-click installation, it appears that there can be multiple server.py instances running at once, which are able to add jobs to the queue. So instead of having multiple consumers we actually have multiple producers. But in all honesty, we can really just require that there be a single producer because this will really never be the bottleneck.

I think it might be possible to add a TangoQueue that keeps track of a fifo queue of unassigned jobs so you can easily pop them off -- currently, it seems like we loop through the liveJobs and find one job that is not marked as assigned, which might cause quite a long wait time if the unassigned jobs are at the end of the hash dictionary (or even maybe starvation, since it seems like jobs with ids that hash to a key which cause them to appear at the start of the dictionary might repeatedly have first priority? not sure about either of these points).

This is a good idea and can within each queue we can enforce data structure consistency as well.

If there are multiple consumers of the jobQueue, it might also help to hold the lock until you mark the job assigned to prevent the case where other consumers cannot find the job because it is completed by another consumer (implied by the comment here). I might not be fully aware of the context of this, so some advice here would be very helpful.

Thank you sm in advance! :-)

Do let me know if you have more questions!

https://github.com/autolab/Tango/pull/148 is also a great place to look at for things which have been fixed by PDL that we could also use

Hi Fan, thank you for the explanation! I still have some doubts and if possible it would be nice to have them clarified:

In getNextPendingJobReuse, I'd like some advice on how the fact that a job is assigned relates to whether a new VM is created for it in allocVM. It seems like this method is always called before it is marked as assigned in the jobManager loop link, and it is also not used anywhere else. Just a wild guess, it seems like there is an assumption here that if a job has already been assigned, it will have a vm allocated to it. Is this assumption valid?
As an extension to the my previous question, I was confused about how the vm id relates to the job id here. As you have explained above this id is used to initializeVM, in the case where ec2 instances are being used. Consider the case where a particular vm instance has been created in the past (say vm id = 5). Now, a new job has id 5 and is about to run a job, but no vms are available. If there were no exceptions being raised prematurely (by the logging statement as mentioned above) it seems like it will fall into this case where a new vm is created with the job id. Will this conflict with the previously created vm? I may have understood this wrongly too. Also, since this new vm created is not added to the TangoQueue which keeps track of the free vm ids to use for each vm names, is this vm being destroyed?
Most importantly, it seems like the queueLock in the JobQueue protects the queue from race conditions due to the multi-threading between Workers and the JobQueue. However, it seems like there isn't anything to prevent possible race conditions between the producers (server) and consumers (jobManager) -- at least from the documentation provided, these run on separate processes. Is there a possible race condition with the additions and deletions to the TangoDictionaries for livejobs for example?
I was also wondering, when does a job get removed from the dead jobs tango dictionary? It seems like things are only removed when delJob is called (and based on the comments, this is to be used with caution). I was just wondering whether dead jobs ever gets reset. #

Thanks a lot!

Hi Fan, thank you for the explanation! I still have some doubts and if possible it would be nice to have them clarified:

In getNextPendingJobReuse, I'd like some advice on how the fact that a job is assigned relates to whether a new VM is created for it in allocVM. It seems like this method is always called before it is marked as assigned in the jobManager loop link, and it is also not used anywhere else. Just a wild guess, it seems like there is an assumption here that if a job has already been assigned, it will have a vm allocated to it. Is this assumption valid?

If a job is already assigned to a VM, you do not need to find another VM for it. You also do not necessarily need to create a new VM for it in allocVM, if REUSE_VMS is set to true

As an extension to the my previous question, I was confused about how the vm id relates to the job id here. As you have explained above this id is used to initializeVM, in the case where ec2 instances are being used.

Sorry if I was unclear previously, but I meant to say that the id attribute of vm refers to the job ID that the VM is servicing. This was poorly named and should have been called job_id instead of id.

Consider the case where a particular vm instance has been created in the past (say vm id = 5). Now, a new job has id 5 and is about to run a job, but no vms are available. If there were no exceptions being raised prematurely (by the logging statement as mentioned above) it seems like it will fall into this case where a new vm is created with the job id. Will this conflict with the previously created vm? I may have understood this wrongly too.

This only happens if MAX_JOBID is reached and it gets reset back to 1, while the previous VM is still running the old job. Possible but extremely unlikely. You can also just make MAX_JOBID very large to be safe, but right now it is already 1000.

Also, since this new vm created is not added to the TangoQueue which keeps track of the free vm ids to use for each vm names, is this vm being destroyed?

Could you point me to which part of the codebase you are referring to?

Most importantly, it seems like the queueLock in the JobQueue protects the queue from race conditions due to the multi-threading between Workers and the JobQueue. However, it seems like there isn't anything to prevent possible race conditions between the producers (server) and consumers (jobManager) -- at least from the documentation provided, these run on separate processes. Is there a possible race condition with the additions and deletions to the TangoDictionaries for livejobs for example?

Unfortunately, I believe so, but there is no real reason to run multiple servers since it will never be a bottleneck

I was also wondering, when does a job get removed from the dead jobs tango dictionary? It seems like things are only removed when delJob is called (and based on the comments, this is to be used with caution). I was just wondering whether dead jobs ever gets reset. #

What do you mean by reset? If you are asking if they are ever restarted, the answer is no.

Thanks a lot!

Definitely and good luck!

autolab / Tango

Tango concurrency and locking issues #182