Open lyhyl opened 1 year ago
Update:
When I debug on vscode, I get a error on the internal of fastr.
Exception has occurred: SystemExit
1
File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 138, in execute_job
sys.exit(1) # Signal that the job failed
File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 182, in main
execute_job(joblist)
File "C:\Users\user\anaconda3\envs\worc\Lib\site-packages\fastr\execution\executionscript.py", line 187, in <module>
main()
SystemExit: 1
Callstack:
Debug console print(job)
:
<Job
id=WORC_BCMS_SY___fingerprinter_MRI_0___all
tool=worc/Fingerprinter:1.0 1.0
tmpdir=vfs://home/tmp/fingerprinter_MRI_0/all/>
Console output stop at:
...
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P744961 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P744962 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P745979 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P80292 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P84221 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P85705 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P8632 with status JobState.finished
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___convert_seg_train_MRI_0___P92423 with status JobState.finished
[INFO] networkrun:0668 >> Waiting for 1976 jobs:
[INFO] networkrun:0676 >> WORC_BCMS_SY___fingerprinter_classification___all: JobState.running
[INFO] networkrun:0676 >> WORC_BCMS_SY___fingerprinter_MRI_0___all: JobState.queued
[INFO] networkrun:0676 >> WORC_BCMS_SY___config_classification_sink___all___0: JobState.hold
[INFO] networkrun:0676 >> WORC_BCMS_SY___config_MRI_0_sink___all___0: JobState.hold
[INFO] networkrun:0676 >> WORC_BCMS_SY___preprocessing_train_MRI_0___P108851: JobState.hold
[INFO] networkrun:0677 >> ---- 1966 JOBS HIDDEN ----
[INFO] networkrun:0679 >> WORC_BCMS_SY___features_train_MRI_0_predict___P8632___0: JobState.hold
[INFO] networkrun:0679 >> WORC_BCMS_SY___features_train_MRI_0_predict___P92423___0: JobState.hold
[INFO] networkrun:0679 >> WORC_BCMS_SY___plot_Estimator___all: JobState.hold
[INFO] networkrun:0679 >> WORC_BCMS_SY___classification___all___0: JobState.hold
[INFO] networkrun:0679 >> WORC_BCMS_SY___performance___all___0: JobState.hold
[INFO] networkrun:0806 >> Finished job WORC_BCMS_SY___fingerprinter_classification___all with status JobState.finished
No more info about what's going on.
Update 2:
Finished job WORC_SY___fingerprinter_MRI_0___all with status JobState.failed
E:\WORC\Tmp\fingerprinter_MRI_0\all__fastr_stderr__.txt:
Traceback (most recent call last):
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\target.py", line 191, in call_subprocess
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
File "C:\Users\user\anaconda3\envs\worc\lib\subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "C:\Users\user\anaconda3\envs\worc\lib\subprocess.py", line 1207, in _execute_child
startupinfo)
FileNotFoundError: [WinError 206] The filename or extension is too long
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\execution\executionscript.py", line 89, in execute_job
job.execute()
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\execution\job.py", line 798, in execute
result = tool.execute(payload)
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\tool.py", line 398, in execute
result = self.interface.execute(target, payload)
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\resources\plugins\interfaceplugins\fastrinterface.py", line 471, in execute
target_result = target.run_command(command)
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\resources\plugins\targetplugins\localbinarytarget.py", line 278, in run_command
return self.call_subprocess(command)
File "C:\Users\user\anaconda3\envs\worc\lib\site-packages\fastr\core\target.py", line 194, in call_subprocess
raise exceptions.FastrExecutableNotFoundError(command[0])
fastr.exceptions.FastrExecutableNotFoundError: Could not find executable "python" on PATH:
My data path is flat and shallow.
E:/Data/000001.nrrd
E:/Data/seg-000001.nrrd
E:/Data/000002.nrrd
E:/Data/seg-000002.nrrd
E:/Data/000003.nrrd
...
Solved the case:
When processing fingerprinter, too many files will make WORC crash on Windows:
len(command)
668
len(" ".join(command))
38655
On Windows, subprocess.Popen
use CreateProcess()
function (ref).
And CreateProcess(lpApplicationName, lpCommandLine, ...)
has limitation that the maximum length of this string is 32,767 characters (ref).
Thus, do not pass files name by command line args. It is better to save them to a file and then pass the list file as input.
Glad you found the issue. If you run into issues again, always use fastr trace to locate the error back to a specific sample in a specific sink, see https://fastr.readthedocs.io/en/stable/static/user_manual.html#debugging-a-network-run-with-errors.
Regarding the error, it's difficult to change the command line execution as pass such arguments as lists, as WORC uses the fastr package for this and does not do this itself. I will ask the fastr developers whether they can change this. I would suggest to either manually execute the command now that you found it, but maybe easier is just to reduce the number of images used for fingerprinting. The default number of images for the fingerprinting is 100, see config['Fingerprinting']['max_num_image']
in the config (https://worc.readthedocs.io/en/latest/static/configuration.html#fingerprinting). I set that pretty high just to be sure it's enough, but 10 - 20 should also be enough. If you're using SimpleWORC or BasicWORC, just change this using the add_config_overrides
function of those objects.
Hope that helps.
Glad you found the issue. If you run into issues again, always use fastr trace to locate the error back to a specific sample in a specific sink, see https://fastr.readthedocs.io/en/stable/static/user_manual.html#debugging-a-network-run-with-errors.
Regarding the error, it's difficult to change the command line execution as pass such arguments as lists, as WORC uses the fastr package for this and does not do this itself. I will ask the fastr developers whether they can change this. I would suggest to either manually execute the command now that you found it, but maybe easier is just to reduce the number of images used for fingerprinting. The default number of images for the fingerprinting is 100, see
config['Fingerprinting']['max_num_image']
in the config (https://worc.readthedocs.io/en/latest/static/configuration.html#fingerprinting). I set that pretty high just to be sure it's enough, but 10 - 20 should also be enough. If you're using SimpleWORC or BasicWORC, just change this using theadd_config_overrides
function of those objects.Hope that helps.
Thank you for your framework and reply. But I'm afraid this option won't help with this issue (Looking back at the previous errors, len(command)
equals 668, far greater than the default value of config['Fingerprinting']['max_num_image']
, i.e., 100).
Option config['Fingerprinting']['max_num_image']
is used internally in:
https://github.com/MStarmans91/WORC/blob/101642bb7e42c9cdc453b778855fbbf3c1290654/WORC/tools/fingerprinting.py#L117-L123
However, the failure mentioned above is occurred when fastr creating subprocess (starting a queued job). At that time, fingerprinting process did not yet exist. It is more of a limitation of fastr in Windows platform. It seems necessary to refactor the input/ouput form of Fingerprinting and related parts. Perhaps I can help.
On the other hand, I tested my code on linux. It seems to run very well, except for a minor issue: pyradiomics=3.1.0 failed calcFeat jobs, which is released 3 weeks ago. A workaround is install pyradiomics=3.0.1 first, then install WORC. I have submitted an issue https://github.com/AIM-Harvard/pyradiomics/issues/831.
BTW, does WORC have any pause-and-resume mechanism? Or add such functionality? Debugging with own data is indeed time-consuming. catch an error, run again from scratch, catch an error, run again from scratch, ... 😢
Thanks for the detailed reply! I was hoping you wouldn't hit the limit this way, but you're right, you still do, and this is a general limit on Windows. I've raised an issue at the fastr package which like I mentions performs the execution, and thus is responsible for this limitation, see https://gitlab.com/radiology/infrastructure/fastr/-/issues/1. For small experiments like the tutorial everything works fine, but if you perform larger experiments with more data, this issue persists.
In the meanwhile, glad everything ran smoothly on Linux, hope pyradiomics fixes the bug soon.
WORC does have a pause-and-resume mechanism build in. Again, this falls back on fastr which performs the execution, see also https://fastr.readthedocs.io/en/stable/static/user_manual.html#continuing-a-network. Summarizing, fastr saves all temporary output in a folder named after the experiment, and if you runb an experiment with the same name, will check which jobs have previously succesfully completed and rerun this. Hence, as long as you keep the experiment name the same in WORC , e.g., https://github.com/MStarmans91/WORCTutorial/blob/master/WORCTutorialBasic.py#L86 of the WORCTutorialBasic, WORC will automatically resume from where it ended previously. Note that it will look like all jobs are still running and nothing is skipped, but these jobs will just check whether the previous instances have succesfully run and the output is valid, so this should be very quick.
I'll leave this issue open untill there is a fix in fastr.
I follow the tutorial and fed my own data. The network execution finished but most of jobs/classification tasks failed.
I run
fastr trace E:/WORC/Tmp\__sink_data__.json --sinks
and get:Running on windows, python 3.7, install via pip.
WORC_config.py
:How to debug / find out any thing misconfigured? BTW, does WORC has any pause-and-resume mechanism?