cea-hpc / clustershell

Scalable cluster administration Python framework — Manage node sets, node groups and execute commands on cluster nodes in parallel.
https://clustershell.readthedocs.io/
421 stars 85 forks source link

TreeMode error, copy a file from local to remote nodes. #549

Open luxiaoyong opened 9 months ago

luxiaoyong commented 9 months ago

clustershell is Excellent! Thanks for sharing this great project. When using the TreeMode to copy a file to remote node and the dest is a directory which is similar to /tmp, I encounter a problem. My command: clush -d -o-q -w compute -b -S --copy /home/tt.txt --dest /tmp This command cause a error, I can use the following command to avoid this error. clush -d -o-q -w compute -b -S --copy /home/tt.txt --dest /tmp/ In more serious cases, the directory on the remote node will be replaced with the copied file, and no error will be reported.

The debug log below:

$ clush -d -o-q -w compute  -b -S --copy /home/tt.txt --dest /tmp
DEBUG:root:clush: STARTING DEBUG
Changing max open files soft limit from 65535 to 8192
User interaction: True
Create STDIN worker: False
clush: enabling tree topology (2 gateways)
clush: nodeset=compute fanout=15 [timeout conn=15.0 cmd=0.0] copy sources=['/home/tt.txt'] dest=/tmp**

control
|- control1
|  `- compute
`- control2
   `- compute2

DEBUG:ClusterShell.Worker.Tree:stderr=True
DEBUG:ClusterShell.Worker.Tree:TreeWorker._launch on compute (fanout=15)
DEBUG:ClusterShell.Worker.Tree:copy source=/home/tt.txt, dest=/tmp
DEBUG:ClusterShell.Worker.Tree:copy arcname=tmp destdir=/
DEBUG:ClusterShell.Worker.Tree:next_hops=[('control1', 'compute')]
DEBUG:ClusterShell.Worker.Tree:trying gateway control1 to reach compute
DEBUG:ClusterShell.Worker.Tree:_copy_remote gateway=control1 source=/home/tt.txt dest=/ reverse=False
DEBUG:ClusterShell.Worker.Tree:_copy_remote: tar cmd: tar -xf - -C '/'
DEBUG:ClusterShell.Task:pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7f6a0bd32760>
SSHCLIENT: ssh -q -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes control1 CLUSTERSHELL_GW_PYTHON_EXECUTABLE=/home/itool/inspector_agent/env/bin/python /home/itool/inspector_agent/env/bin/python -m ClusterShell.Gateway -Bu
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f6a0c857ee0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f6a0c857ee0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f6a0c857ee0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f6a0c857ee0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f6a0c857ee0> not registered
DEBUG:ClusterShell.Propagation:shell nodes=compute timeout=-1 worker=140093441719552 remote=True
DEBUG:ClusterShell.Propagation:send_queued: 0
DEBUG:ClusterShell.Propagation:write buflen=10240
DEBUG:ClusterShell.Propagation:send_queued: 1
DEBUG:ClusterShell.Worker.Tree:TreeWorker: _check_ini (0, 0)
control1: b'<?xml version="1.0" encoding="utf-8"?>'
control1: b'<channel version="1.9.1"><message type="ACK" msgid="2" ack="0"></message>'
DEBUG:ClusterShell.Propagation:recv: Message CHA (type: CHA, msgid: 3)
DEBUG:ClusterShell.Propagation:channel started (version 1.9.1 on remote gateway)
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 2, ack: 0)
DEBUG:ClusterShell.Propagation:recv_cfg
DEBUG:ClusterShell.Propagation:CTL - connection with gateway fully established
DEBUG:ClusterShell.Propagation:dequeuing sendq: Message CTL (type: CTL, msgid: 1, srcid: 140093441719552, action: shell, target: compute)
control1: b'<message type="ACK" msgid="4" ack="1"></message>'
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 4, ack: 1)
DEBUG:ClusterShell.Propagation:got ack (ACK)
DEBUG:ClusterShell.Propagation:dequeuing sendq: Message CTL (type: CTL, msgid: 2, srcid: 140093441719552, action: write, target: compute)
control1: b'<message type="ACK" msgid="6" ack="2"></message>'
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 6, ack: 2)
DEBUG:ClusterShell.Propagation:got ack (ACK)
control1: b'<message type="SER" msgid="7" srcid="140093441719552" nodes="compute">gASVXgAAAAAAAABDWnRhcjogdG1wOiBDYW5ub3Qgb3BlbjogRmlsZSBleGlzdHMKdGFyOiBFeGl0aW5nIHdpdGggZmFpbHVyZSBzdGF0dXMgZHVlIHRvIHByZXZpb3VzIGVycm9yc5Qu</message>'
DEBUG:ClusterShell.Propagation:recv: Message SER (type: SER, msgid: 7, srcid: 140093441719552, nodes: compute)
control1: b'<message type="RET" msgid="8" srcid="140093441719552" retcode="2" nodes="compute"></message>'
**compute: tar: tmp: Cannot open: File exists
compute: tar: Exiting with failure status due to previous errors**
DEBUG:ClusterShell.Propagation:recv: Message RET (type: RET, msgid: 8, srcid: 140093441719552, retcode: 2, nodes: compute)
clush: compute: exited with exit code 2
DEBUG:ClusterShell.Worker.Tree:_on_remote_node_close compute 0 via gw control1
DEBUG:ClusterShell.Worker.Tree:check_fini 1 1
DEBUG:ClusterShell.Worker.Tree:TreeWorker._check_fini <ClusterShell.Worker.Tree.TreeWorker object at 0x7f6a0bd43d00> call pchannel_release for gw control1
DEBUG:ClusterShell.Task:pchannel_release control1 <ClusterShell.Worker.Tree.TreeWorker object at 0x7f6a0bd43d00>
DEBUG:ClusterShell.Task:pchannel_release: destroying channel <ClusterShell.Propagation.PropagationChannel object at 0x7f6a0bd32760>
DEBUG:ClusterShell.Propagation:ev_close gateway=control1 <ClusterShell.Propagation.PropagationChannel object at 0x7f6a0bd32760>
DEBUG:ClusterShell.Propagation:ev_close rc=None
DEBUG:ClusterShell.Propagation:error on gateway control1 (setup=True)
DEBUG:ClusterShell.Propagation:gateway control1 now set as unreachable
DEBUG:ClusterShell.Worker.EngineClient:<EnginePort at 0x140093453336000 (streams=(7, 8))>: dropped msg: (<function Task._abort at 0x7f6a0c844280>, (False,), {})
degremont commented 9 months ago

Thanks. Could you just confirm the version you are using?

luxiaoyong commented 6 months ago

I appreciate your attention to my problem. My version is 1.9.1,and we have temporarily avoided this problem by adding /.

degremont commented 6 months ago

This is a known limitation. The code says:

The only case that we don't support is when source is a file and dest is a dir without a finishing / (in that case we cannot determine remotely whether it is a file or a directory).

The code could not know before initiating the transfer if /tmp is an existing directory. There are no handshake where this could have been negotiated in 2 steps.

The recommendation is to add a final / when this is a remote directory.

luxiaoyong commented 4 months ago

Thank you. I see.