ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

Can't fetch datalad-pair job results #561

Closed chaselgrove closed 3 years ago

chaselgrove commented 3 years ago

Fetching the job results for a datalad-pair orchestrated job on an aws-ec2 instance fails with:

2020-11-24 16:40:14,942 [ERROR ] ConnectionOpenFailedError: 'ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=/home/ch/.cache/datalad/sockets/bc54bd86 ubuntu@3.238.172.53' failed with exitcode 255 [Failed to open SSH connection (could not start ControlMaster process)] [sshconnector.py:open:541] (ConnectionOpenFailedError)

To reproduce:

reproman create -t aws-ec2 -b image=ami-0fe4bc00534545c58 -b instance_type=t3.large -b key_filename=$HOME/.ssh/aws -b key_name=<key name> -b user=ubuntu nitrc-ce
datalad create data
cd data
reproman run -r nitrc-ce --orchestrator datalad-pair pwd
reproman jobs <job id>

This AMI is CE-LITE; the same happens on ami-07ae3592c03add705 (CE standard).

Using the plain orchestrator works as expected.

kyleam commented 3 years ago

Hmm, I don't see the identity file in the ConnectionOpenFailedError, but there's a dance behind the scenes to try to set that up.

https://github.com/ReproNim/reproman/blob/7af2e407fb60d782dc049e62082744600eff0574/reproman/support/jobs/orchestrators.py#L773-L779

Are you using DataLad v0.13.5 or later? There was a rework of some SSH-related things, and I wonder if somehow the identify file argument got lost in this spot.

kyleam commented 3 years ago

I said:

There was a rework of some SSH-related things, and I wonder if somehow the identify file argument got lost in this spot.

No, that doesn't look to be the issue (and wasn't a good guess because that should prevent any setup of the remote). I think the issue is that when we resurrect the orchestrator to fetch a job, it doesn't trigger the above code that sets up a custom identity file.