DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
891 stars 241 forks source link

About unstable error #4373

Open manabuishii opened 1 year ago

manabuishii commented 1 year ago

I run toil with lustre and Univa Grid Engine.

Sometimes succeded and sometimes fails.

When I rerun, sometimes succeded and sometimes fails.

I faced same things on lustre and slurm.

I set --retryCount 3 but I think it is not effected.

I run several times with --restart . finally it becomes success.

[2023-02-07T22:07:06+0900] [MainThread] [I] [toil.leader] 0 jobs are running, 9 jobs are issued and waiting to run
Following jobs do not exist or permissions are not sufficient: 
1304598
[2023-02-07T22:08:09+0900] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.12.dfast.cwl kind-CWLJob/instance-8m44uilm v3
Exit reason: None
[2023-02-07T22:08:09+0900] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'CWLJob' dfast-filelist.cwl.12.dfast.cwl kind-CWLJob/instance-8m44uilm v4
[2023-02-07T22:08:09+0900] [MainThread] [W] [toil.leader] Log from job "kind-CWLJob/instance-8m44uilm" follows:
=========>
        [2023-02-07T22:07:18+0900] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2023-02-07T22:07:18+0900] [MainThread] [I] [toil] Running Toil version 5.9.0-8155e0a981f4d728762a7cbc920b7ed544fe4ae7 on host it050.
        [2023-02-07T22:07:18+0900] [MainThread] [I] [toil.worker] Working on job 'CWLJob' dfast-filelist.cwl.12.dfast.cwl kind-CWLJob/instance-8m44uilm v3
        [2023-02-07T22:07:19+0900] [MainThread] [I] [toil.worker] Loaded body Job('CWLJob' dfast-filelist.cwl.12.dfast.cwl kind-CWLJob/instance-8m44uilm v3) from description 'CWLJob' dfast-filelist.cwl.12.dfast.cwl kind-CWLJob/instance-8m44uilm v3
        Traceback (most recent call last):
          File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/worker.py", line 390, in workerScript
            with fileStore.open(job):
          File "/opt/pkg/intel/oneapi/intelpython/latest/lib/python3.9/contextlib.py", line 119, in __enter__
            return next(self.gen)
          File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 66, in open
            self._removeDeadJobs(self.coordination_dir)
          File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/fileStores/nonCachingFileStore.py", line 197, in _removeDeadJobs
            if not process_name_exists(coordination_dir, jobState['jobProcessName']):
          File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/lib/threading.py", line 318, in process_name_exists
            nameFD = os.open(nameFileName, os.O_RDONLY)
        FileNotFoundError: [Errno 2] No such file or directory: '/lustre7/home/manabu/work/MAG/cwl/MAGoutput/103/workdir/5191a5338c755b57ae125a56fc3d594d/tmpxsl0ac8x'
        [2023-02-07T22:07:19+0900] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host it050
<=========
Traceback (most recent call last):
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/cwl/cwltoil.py", line 3572, in main
    outobj = toil.restart()
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/common.py", line 1065, in restart
    return self._runMainLoop(rootJobDescription)
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/common.py", line 1468, in _runMainLoop
    return Leader(config=self.config,
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/leader.py", line 292, in run
    self.innerLoop()
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/leader.py", line 789, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/leader.py", line 747, in _gatherUpdatedJobs
    self.process_finished_job(bsID, exitStatus, wall_time=wallTime, exit_reason=exitReason)
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/leader.py", line 1188, in process_finished_job
    return self.process_finished_job_description(issued_job, result_status, wall_time, exit_reason, batch_system_id)
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/leader.py", line 1243, in process_finished_job_description
    StatsAndLogging.writeLogFiles(replacement_job.chainedJobs, log_stream, self.config, failed=True)
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/statsAndLogging.py", line 122, in writeLogFiles
    with writeFn(fullName, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'MAGoutput/103/writeLogs/failed_CWLJob_dfast--filelist.cwl.12.dfast.cwl_kind--CWLJob-instance--8m44uilm_v3000.log'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lustre7/home/manabu/work/MAG/cwl/venv/bin/toil-cwl-runner", line 8, in <module>
    sys.exit(main())
  File "/lustre7/home/manabu/work/MAG/cwl/venv/lib/python3.9/site-packages/toil/cwl/cwltoil.py", line 3576, in main
    if getattr(err, "exit_code") == CWL_UNSUPPORTED_REQUIREMENT_EXIT_CODE:
AttributeError: 'FileNotFoundError' object has no attribute 'exit_code'

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1287

manabuishii commented 1 year ago

4375 (I test only first commit) works fine for our environment.

I'll test entire code.

I think --retryCount also works fine.

Issued job 'CWLJob' dfast-filelist.cwl.28.dfast.cwl kind-CWLJob/instance-92yuu2x8 v1 with job batch system ID: 101 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Issued job 'CWLJob' dfast-filelist.cwl.9.dfast.cwl kind-CWLJob/instance-pbihqqjg v1 with job batch system ID: 102 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.16.dfast.cwl kind-CWLJob/instance-40zznj18 v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.16.dfast.cwl kind-CWLJob/instance-40zznj18 v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.16.dfast.cwl kind-CWLJob/instance-40zznj18 v1 with ID kind-CWLJob/instance-40zznj18 to 3
Issued job 'CWLJob' dfast-filelist.cwl.16.dfast.cwl kind-CWLJob/instance-40zznj18 v2 with job batch system ID: 103 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.25.dfast.cwl kind-CWLJob/instance-lhuoiika v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.25.dfast.cwl kind-CWLJob/instance-lhuoiika v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.25.dfast.cwl kind-CWLJob/instance-lhuoiika v1 with ID kind-CWLJob/instance-lhuoiika to 3
Issued job 'CWLJob' dfast-filelist.cwl.25.dfast.cwl kind-CWLJob/instance-lhuoiika v2 with job batch system ID: 104 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.48.dfast.cwl kind-CWLJob/instance-k8lixfj7 v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.48.dfast.cwl kind-CWLJob/instance-k8lixfj7 v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.48.dfast.cwl kind-CWLJob/instance-k8lixfj7 v1 with ID kind-CWLJob/instance-k8lixfj7 to 3
Issued job 'CWLJob' dfast-filelist.cwl.48.dfast.cwl kind-CWLJob/instance-k8lixfj7 v2 with job batch system ID: 105 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.32.dfast.cwl kind-CWLJob/instance-vnt7r2bv v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.32.dfast.cwl kind-CWLJob/instance-vnt7r2bv v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.32.dfast.cwl kind-CWLJob/instance-vnt7r2bv v1 with ID kind-CWLJob/instance-vnt7r2bv to 3
Issued job 'CWLJob' dfast-filelist.cwl.32.dfast.cwl kind-CWLJob/instance-vnt7r2bv v2 with job batch system ID: 106 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.37.dfast.cwl kind-CWLJob/instance-wb2zftb7 v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.37.dfast.cwl kind-CWLJob/instance-wb2zftb7 v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.37.dfast.cwl kind-CWLJob/instance-wb2zftb7 v1 with ID kind-CWLJob/instance-wb2zftb7 to 3
Issued job 'CWLJob' dfast-filelist.cwl.37.dfast.cwl kind-CWLJob/instance-wb2zftb7 v2 with job batch system ID: 107 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Job failed with exit value 1: 'CWLJob' dfast-filelist.cwl.15.dfast.cwl kind-CWLJob/instance-xq28p1nm v1
Exit reason: None
No log file is present, despite job failing: 'CWLJob' dfast-filelist.cwl.15.dfast.cwl kind-CWLJob/instance-xq28p1nm v1
Due to failure we are reducing the remaining try count of job 'CWLJob' dfast-filelist.cwl.15.dfast.cwl kind-CWLJob/instance-xq28p1nm v1 with ID kind-CWLJob/instance-xq28p1nm to 3
Issued job 'CWLJob' dfast-filelist.cwl.15.dfast.cwl kind-CWLJob/instance-xq28p1nm v2 with job batch system ID: 108 and disk: 1.0 Gi, memory: 13.0 Gi, cores: 1.0, accelerators: [], preemptible: False
Finished toil run successfully.