DAMPEEU / DmpWorkflow

Workflow framework for DAMPE remote computing & accounting
1 stars 4 forks source link

crossing of jobInstances when being deployed #102

Open zimmerst opened 7 years ago

zimmerst commented 7 years ago

The following is a 'well known bug' that has been present in our application of the workflow. On the computing farms there are agents that pull new job instances, based on unique identifiers, which are sub-sequentially dispatched as "HPC jobs". The bug occurs that when at least 2 different jobs are actively running (i.e. MC simulation campaigns of different configurations), there is an occasional mix. As example:

Task1: protons, 1 TeV Task2: electrons, 100 GeV

when both of these tasks have active instances, there appears to be a crossing of 'instanceIds' wrongly associated with the taskId. This results in these jobs being executed correctly, but containing the wrong content. This may be related to #101 but is difficult to track down. It may happen in up to 10% of the time. It may - but this is unlikely - also happen at the DB level, so an exact determination is difficult and time-consuming.

Inside the team, this appears as "PdgId / Energy Range bug".

zimmerst commented 7 years ago

One way to test this setup in a less harmful way is: