crossing of jobInstances when being deployed

DAMPEEU / DmpWorkflow

Workflow framework for DAMPE remote computing & accounting

1 stars 4 forks source link

The following is a 'well known bug' that has been present in our application of the workflow. On the computing farms there are agents that pull new job instances, based on unique identifiers, which are sub-sequentially dispatched as "HPC jobs". The bug occurs that when at least 2 different jobs are actively running (i.e. MC simulation campaigns of different configurations), there is an occasional mix. As example:

Task1: protons, 1 TeV Task2: electrons, 100 GeV

when both of these tasks have active instances, there appears to be a crossing of 'instanceIds' wrongly associated with the taskId. This results in these jobs being executed correctly, but containing the wrong content. This may be related to #101 but is difficult to track down. It may happen in up to 10% of the time. It may - but this is unlikely - also happen at the DB level, so an exact determination is difficult and time-consuming.

Inside the team, this appears as "PdgId / Energy Range bug".

One way to test this setup in a less harmful way is:

deploy a 'junk DB', just use the same server, but use the DEVEL database (exists already, contact me for information, web-server that serves the DEVEL DB is dampevm8)
setup one VM as endpoint and one VM as server (we can utilize one of our VMs for this). This will act as "CNAF gateway", scripts will be written, but not submitted (since VM has no access)
once the daemon has dispatched a few jobs, this problem should become apparent; batch errors are not caught, so the system will think that the batch jobs are submitted. What is important is that multiple tasks need to have active instances FIXME: need to assign random variable at call; can do this by simply utilizing a mock bsub command
```
#!/bin/env python
from random import randint
rn = randint(1000) # random number generated
print "submitted job {0}".format(rn)
```

DAMPEEU / DmpWorkflow

crossing of jobInstances when being deployed #102