Open ghost opened 9 years ago
Dear Peter, This is very strange and rather worrying: The test runs perfectly on python3.2 on my box and in Travis-CI.
Would you mind rerunning the tests but increasing the verbosity to 6 for the last two calls to pipeline_run() to show what is happening? These are on lines 221 and 227. Change:
pipeline_run(forcedtorun_tasks = [make_start], multiprocess = 10, verbose=0, pipeline= "main")
...
pipeline_run(multiprocess = 10, verbose=0, pipeline= "main")
to
pipeline_run(forcedtorun_tasks = [make_start], multiprocess = 10, verbose=6, pipeline= "main")
...
pipeline_run(multiprocess = 10, verbose=6, pipeline= "main")
Hi Leo,
sure, below is the (lengthy) output. Is it possible that this is caused by something in the "environment"; e.g., I was thinking about the file system and added a time.sleep(1)
after each file creation, but the test is still failing.
Best
Peter
Running test_split_subdivide_checkpointing.py
Run pipeline normally...
_______________________________________
Tasks which will be run:
Task enters queue = 'test::test_split_subdivide_checkpointing.make_start'
-> start
Job = [None -> test_split_subdivide_checkpointing/start] ...
Missing file [test_split_subdivide_checkpointing/start]
Job = [None -> test_split_subdivide_checkpointing/start] completed
Completed Task = 'test::test_split_subdivide_checkpointing.make_start'
Task enters queue = 'test::test_split_subdivide_checkpointing.split_start'
-> XXX.split where XXX = 0 .. N, N = previous N + 1
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] Missing output file
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] completed
Completed Task = 'test::test_split_subdivide_checkpointing.split_start'
Task enters queue = 'test::test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] Missing output file
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] completed
Completed Task = 'test::test_split_subdivide_checkpointing.subdivide_start'
Check that running again does nothing. (All up to date).
________________________________________
Tasks which are up-to-date:
Task = 'test::test_split_subdivide_checkpointing.make_start'
"-> start"
Job = [None
-> test_split_subdivide_checkpointing/start]
Task = 'test::test_split_subdivide_checkpointing.split_start'
"-> XXX.split where XXX = 0 .. N, N = previous N + 1"
Job = [[test_split_subdivide_checkpointing/start]
-> test_split_subdivide_checkpointing/*.split]
Task = 'test::test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split
-> test_split_subdivide_checkpointing/0_*.subdivided
, test_split_subdivide_checkpointing/0]
________________________________________
________________________________________
Tasks which will be run:
Running again with forced tasks to generate more files...
________________________________________
Tasks which will be run:
Task enters queue = 'test::test_split_subdivide_checkpointing.make_start' : Forced to rerun
-> start
force task Job = [None -> test_split_subdivide_checkpointing/start] to rerun
Job = [None -> test_split_subdivide_checkpointing/start] completed
Completed Task = 'test::test_split_subdivide_checkpointing.make_start'
Task enters queue = 'test::test_split_subdivide_checkpointing.split_start'
-> XXX.split where XXX = 0 .. N, N = previous N + 1
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] unnecessary: already up to date
Uptodate Task = 'test::test_split_subdivide_checkpointing.split_start'
Task enters queue = 'test::test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] unnecessary: already up to date
Uptodate Task = 'test::test_split_subdivide_checkpointing.subdivide_start'
test_split_subdivide_checkpointing/
E Run pipeline normally...
________________________________________
Tasks which will be run:
Task enters queue = 'test_split_subdivide_checkpointing.make_start'
-> start
Job = [None -> test_split_subdivide_checkpointing/start] ...
Missing file [test_split_subdivide_checkpointing/start]
Job = [None -> test_split_subdivide_checkpointing/start] completed
Completed Task = 'test_split_subdivide_checkpointing.make_start'
Task enters queue = 'test_split_subdivide_checkpointing.split_start'
-> XXX.split where XXX = 0 .. N, N = previous N + 1
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] Missing output file
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] completed
Completed Task = 'test_split_subdivide_checkpointing.split_start'
Task enters queue = 'test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] Missing output file
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] completed
Completed Task = 'test_split_subdivide_checkpointing.subdivide_start'
Check that running again does nothing. (All up to date).
________________________________________
Tasks which are up-to-date:
Task = 'test_split_subdivide_checkpointing.make_start'
"-> start"
Job = [None
-> test_split_subdivide_checkpointing/start]
Task = 'test_split_subdivide_checkpointing.split_start'
"-> XXX.split where XXX = 0 .. N, N = previous N + 1"
Job = [[test_split_subdivide_checkpointing/start]
-> test_split_subdivide_checkpointing/*.split]
Task = 'test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split
-> test_split_subdivide_checkpointing/0_*.subdivided
, test_split_subdivide_checkpointing/0]
____________________________________
________________________________________
Tasks which will be run:
Running again with forced tasks to generate more files...
________________________________________
Tasks which will be run:
Task enters queue = 'test_split_subdivide_checkpointing.make_start' : Forced to rerun
-> start
force task Job = [None -> test_split_subdivide_checkpointing/start] to rerun
Job = [None -> test_split_subdivide_checkpointing/start] completed
Completed Task = 'test_split_subdivide_checkpointing.make_start'
Task enters queue = 'test_split_subdivide_checkpointing.split_start'
-> XXX.split where XXX = 0 .. N, N = previous N + 1
Job = [[test_split_subdivide_checkpointing/start] -> test_split_subdivide_checkpointing/*.split] unnecessary: already up to date
Uptodate Task = 'test_split_subdivide_checkpointing.split_start'
Task enters queue = 'test_split_subdivide_checkpointing.subdivide_start'
Job = [test_split_subdivide_checkpointing/0.split -> test_split_subdivide_checkpointing/0_*.subdivided, test_split_subdivide_checkpointing/0] unnecessary: already up to date
Uptodate Task = 'test_split_subdivide_checkpointing.subdivide_start'
test_split_subdivide_checkpointing/
E
======================================================================
ERROR: test_newstyle_ruffus (test_split_subdivide_checkpointing.Test_ruffus)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_split_subdivide_checkpointing.py", line 171, in test_newstyle_ruffus
expected_files_after_3_runs)
File "test_split_subdivide_checkpointing.py", line 140, in check_file_exists_or_not_as_expected
raise Exception("Expected file %s" % (tempdir + ee))
Exception: Expected file test_split_subdivide_checkpointing/1.split
======================================================================
ERROR: test_ruffus (test_split_subdivide_checkpointing.Test_ruffus)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_split_subdivide_checkpointing.py", line 215, in test_ruffus
expected_files_after_3_runs)
File "test_split_subdivide_checkpointing.py", line 140, in check_file_exists_or_not_as_expected
raise Exception("Expected file %s" % (tempdir + ee))
Exception: Expected file test_split_subdivide_checkpointing/1.split
----------------------------------------------------------------------
Ran 2 tests in 10.950s
FAILED (errors=2)
Something very strange seems to be happening. For some strange reason make_start()
is not updating its output file times. This may be just because of a combination of interactions between timestamping and poor file system time granularities.
I have updated Ruffus to add (even more) status messages for up to date tasks.
Could you do a git pull and run the unittest with an extra "-v" on the command line and send me the output?
python test/test_split_subdivide_checkpointing.py -v
Thanks
Leo
Sorry that it took me so long to re-run the tests - busy days... I ran the checkout version 1202 of the test on three different file systems (I can send you the output via mail if you want, it's a bit too long to post it here). Bottom line:
So, right now, it appears to be related to high latency of the file system; however, I added again a time.sleep(1)
after each file creation step and re-tested on our cluster file system, it still fails (there is no load on the file servers as I write these lines). This was surprising, so I increased from 1 second to 3 seconds, yet the tests are still failing. That leaves me puzzled; is adding time.sleep
not the right way of going after this? Or should I talk to our IT guys, because the latency of our cluster file system is expected to be lower than 1s when there is no high load?
So I guess you can close this as "not a Ruffus issue", but I would still like to hear your opinion on that. Thanks
Peter
Dear Peter, It is obviously a ruffus issue if Ruffus is not working properly :-) The thing that we have to work out is what the latency means 1) The created files are not appearing immediately 2) The created files do not have the correct file times 3) The timestamps have extremely poor resolution 4) Your cluster head node system time is out of sync with that of the disk servers. Could you send me the full output via mail please? My email address is ruffus at llew.org.uk
Thanks
Leo
I just ran the unit tests for 2.6.3 (using Python 3.2) and the following two unit tests fail:
After following this "hint" ;-) in the code...
I can confirm that some of the files are created as expected (well, obviously):
However, right now I fail to see where the whole thing breaks... Please let me know if you need more information. Best Peter