ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

run an HTCondor cluster on AWS #490

Closed mjtravers closed 4 years ago

mjtravers commented 4 years ago

Fixes #392

Basic functionality. Still a work in progress

codecov[bot] commented 4 years ago

Codecov Report

Merging #490 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #490   +/-   ##
=======================================
  Coverage   89.35%   89.35%           
=======================================
  Files         149      149           
  Lines       12267    12267           
=======================================
  Hits        10961    10961           
  Misses       1306     1306

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b40b96c...ad85f13. Read the comment docs.

chaselgrove commented 4 years ago

This looks fine to me and works with datalad-pair-run and plain. The Travis test time out, but tests pass for me.

What else does this need? Anything else to test?

kyleam commented 4 years ago

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

kyleam commented 4 years ago

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

Looks like it's test_orchestrators.py::test_orc_datalad_concurrent[sub:condor-orc:pair-run]. Without the changes in this PR, the Travis job doesn't stall. The likely culprit from this PR is

diff --git a/reproman/support/jobs/job_templates/submission/condor.template b/reproman/support/jobs/job_templates/submission/condor.template
index bdaf040ad..c125fe507 100644
--- a/reproman/support/jobs/job_templates/submission/condor.template
+++ b/reproman/support/jobs/job_templates/submission/condor.template
@@ -9,6 +9,8 @@ environment  = ""
 Output  = {{ _meta_directory }}/stdout.$(Process)
 Error   = {{ _meta_directory }}/stderr.$(Process)
 Log     = {{ _meta_directory }}/log.$(Process)
+should_transfer_files   = Yes
+when_to_transfer_output = ON_EXIT

 {#
   TODO: Need to check spec form compatibility between different batch

The changes in this PR do not make any of the tests stall for me locally, so perhaps we're looking at git-annex-related ssh stalling that we've been dealing with on DataLad's end. That was specific to Xenial, so I've triggered a job with Bionic to see if the stall still happens there.

In my local runs, the above change leads to a failure in test_orc_datalad_run[sub:condor-orc:pair]. Interestingly, that seems to have passed in the stalled job on Travis. I need to look into it more.

kyleam commented 4 years ago

I just pushed a commit (ec48e8461) that removes those condor settings. As I mentioned in that commit, those settings are from trying to get things working without NFS, and I don't think they make sense as of 3d93c7e0a.

chaselgrove commented 4 years ago

Works on macOS.

kyleam commented 4 years ago

Works on macOS.

Great, thanks for checking.

The Travis job no longer stalls, and the job with condor enabled passes. Another job fails, but it's in the make -C ... phase after the tests. I haven't seen that in recent Travis runs, I don't see it locally, and it seems unlikely to be related to the PR, so I'd say we should wait to worry about it until it pops up again.

chaselgrove commented 4 years ago

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message. Do your changes need another test?

kyleam commented 4 years ago

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message.

Ah, thought that was fast :)

Do your changes need another test?

If you don't mind, it'd be good to confirm. Thanks.

chaselgrove commented 4 years ago

@kyleam ec48e84 works on macOS.