Error with latest Minigraph-Cactus version 2.6.5

brettChapman commented 1 year ago

Hi

I've recently updated to the latest version (v2.6.5) so I could use ODGI and the inbuilt visualisations, and I'm now getting errors I didn't before, complaining about system memory. I could complete the job before when using v2.5.4.

Thanks.

[2023-07-27T05:52:04+0000] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2023-07-27T05:52:05+0000] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
[2023-07-27T05:52:05+0000] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/cactus/jobStore)
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus-align", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/setup/cactus_align.py", line 172, in main
    results_dict = toil.start(Job.wrapJobFn(batch_align_jobs, align_jobs))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1064, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1516, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 251, in run
    self.innerLoop()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 741, in innerLoop
    self._processReadyJobs()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 636, in _processReadyJobs
    self._processReadyJob(message.job_id, message.result_status)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 552, in _processReadyJob
    self._runJobSuccessors(job_id)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 442, in _runJobSuccessors
    self.issueJobs(successors)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 919, in issueJobs
    self.issueJob(job)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 896, in issueJob
    jobBatchSystemID = self.batchSystem.issueBatchJob(jobNode, job_environment=job_environment)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 755, in issueBatchJob
    self.check_resource_request(scaled_desc)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 506, in check_resource_request
    raise e
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 502, in check_resource_request
    super().check_resource_request(requirer)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 344, in check_resource_request
    raise e
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 337, in check_resource_request
    raise InsufficientSystemResources(requirer, resource, available)
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'cactus_cons' kind-cactus_cons/instance-ydclnhes v1 is requesting 135084490752 bytes of memory, more than the maximum of 126000000000 bytes of>
srun: error: node-12: task 0: Exited with exit code 1

brettChapman commented 1 year ago

I saw this in the wiki: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#running-on-a-cluster

Adding --consMemory 126G didn't work, I get an error saying --consMemory not recognised.

brettChapman commented 1 year ago

Disregard, I've realised --consMemory is for cactus-align only. Will try again and update.

glennhickey commented 1 year ago

Yeah, the --consMemory option will fix it.

The issue was that cactus often used request less memory from Toil jobs than it actually used. This was fine most of the time, though it could certainly result in crashes when you ran out of memory.

But for slurm (at least on our cluster), going over the requested memory is instant eviction. This meant that I had to go into each job and make its memory estimate much more conservative. For jobs that don't use much memory, or that are very simple functions of the input size, it wasn't a big deal. But cactus_consolidated is really hard to predict, and for now it errs on the side of being too conservative. I do hope to improve it going forward.

On a semi-related note, you can add memory usage to your cactus logs by setting export CACTUS_LOG_MEMORY=1

ComparativeGenomicsToolkit / cactus

Error with latest Minigraph-Cactus version 2.6.5 #1112