Open pdeperio opened 7 years ago
So, I did a first test on datamanger adding the correction tasks together the checksum, but they take too much time because for each task run over all the runs and in the meanwhile the runs are waiting to be verified. I'll try to create a new session of cax in parallel to check if it work with reasonable time.
Test definitely negative! also if with a new cax process, AddElectronLifetime
and AddGains take a huge amount of time to run over all the runs. On the single run is fast, but using cax --once --config ...
took very long time.
The bottle neck I think is that each task run ever all runs before to pass at next task. maybe could be more efficient do the contrary, for each run do all tasks.
In any case we saturate the RAM (MB) on datamanager computer:
total used free shared buffers cached
Mem: 20424 20423 0 2649 0 17792
-/+ buffers/cache: 2631 17792
Swap: 1024 1023 0
I think this is related to https://github.com/XENON1T/cax/issues/108 and https://github.com/XENON1T/cax/pull/114, which we never understood, (i.e.:
Please review that issue and PR.
Ciao, maybe I found the way to stop the submission of thousand jobs useless with massive-cax.
Basically I made a check on the variables present on RunDB in "processor"
field and I check that all entries of "correction_versions"
are present.
If true the code generate the script to submit the jobs.
Of course another check is if there are the processed and the minitrees files on the local directories on midway and understand where the code is running (which host). I have still to complete the code but since the first test it works.
@lucrlom I think we can raise the memory on xetransfer for the virtual machine xe1t-datamanager if it helps. I run in any case two cax-like sessions (massive-cax and massive-ruciax) with the user xe1ttransfer. Each process needs ~12 GB of memory. I haven't yet understood why these processes need so much memory (it seems a lot to me).
We need to reduce the number of short jobs being submitted on Midway. Some possible solutions that may or may not be combined:
Bundling runs (which should be fine since we're not running very long pax processing anymore) so each job runs longer,
Using job arrays to reduce number of jobs scheduler handles (I think),
Running Corrections locally (this seems to be fast now after previous hax improvements) and implementing local checks for intensive processes (e.g.
AddChecksum
,ProcessBatchQueueHax
) before submitting jobs that actually run those tasks.Add minitrees to RunsDB, to facilitate the local checking in 3.