deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

[BUG] rsync receive data from remote platform failed #434

Open pxlxingliang opened 7 months ago

pxlxingliang commented 7 months ago

Bug summary

I use dpgen to submit a dpgen job to run the fp on SUGON platform, the fp is like:

    "fp": [
        {
            "command": "OMP_NUM_THREADS=1 mpirun -np 4 $abacus | tee out.log",
            "machine": {
        "batch_type": "Slurm",
        "context_type": "SSHContext",
                "local_root": "./",
                "remote_root": "/public/home/abacus/tmp",
                "remote_profile": {
                    "key_filename": "sugon",
                    "hostname": "cancon.hpccube.com",
                    "username": "abacus",
                    "port": 65023
                }
            },
            "resources": {
            "batch_type": "Slurm",
                "number_node": 1,
                "cpu_per_node": 32,
        "group_size": 1,
                "queue_name": "kshdnormal",
                "custom_flags": [
                    "#SBATCH --gres=dcu:4"
                ],
                "source_list": [
                    "/public/home/abacus/run_dcu.sh"
                ]
            }
        }
    ]

The fp job can be submitted to sugon and run abacus successfully, but it throw the below warning when dpgen get the returned results:

2024-01-23 13:53:23,653 - ERROR : Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', 'abacus@cancon.hpccube.com:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
    self.download_jobs()
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
    self.machine.context.download(self)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
    self._get_files(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
    self.ssh_session.get(from_f, to_f)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
    return rsync(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 136, in rsync
    raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', 'abacus@cancon.hpccube.com:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
2024-01-23 13:53:23,655 - INFO : Retrying in 1 minute...

It seems that rsync try to do chown action, but it is failed.

DP-GEN Version

0.11.1.dev51+gbea559b

Platform, Python Version, Remote Platform, etc

Platform: bohrium

Python: 3.8.8

Remote Platform: Sugon

Input Files, Running Commands, Error Log, etc.

dpgen.zip Need an extra Sugon secret file named as "sugon". command: dpgen init_bulk init.json machine.json

Steps to Reproduce

  1. download the secret file of sugon, and name as "sugon"
  2. modify the fp in machine.json
  3. submit the job: dpgen init_bulk init.json machine.json

Further Information, Files, and Links

No response

njzjz commented 7 months ago

It's not related to the remote machine, but it seems you didn't have the access to chown on the local machine.

njzjz commented 7 months ago

Could you try to add --no-perms flag to rsync?

pxlxingliang commented 7 months ago

Could you try to add --no-perms flag to rsync?

I have try to add this flag, but it did not work:

^CTraceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
    self.download_jobs()
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
    self.machine.context.download(self)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
    self._get_files(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
    self.ssh_session.get(from_f, to_f)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
    return rsync(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 137, in rsync
    raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '--no-perms', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', 'abacus@cancon.hpccube.com:/public/home/abacus/tmp/013b6a211b33560666b55f011a60f9771da63b60/013b6a211b33560666b55f011a60f9771da63b60.tar.gz', '/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/013b6a211b33560666b55f011a60f9771da63b60.tar.gz']: b'rsync: chown "/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/.013b6a211b33560666b55f011a60f9771da63b60.tar.gz.JIoelN" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'

This issue may relate to directory right of Bohrium "/personal". When I run this test on others path, it will work.

njzjz commented 7 months ago

Try no-o. I guess no-g may also be required. Below is the explanation.

    -r, --recursive             recurse into directories
    -l, --links                 copy symlinks as symlinks
    -p, --perms                 preserve permissions
    -t, --times                 preserve modification times
    -o, --owner                 preserve owner (super-user only)
    -g, --group                 preserve group
    -D                          same as --devices --specials
        --devices               preserve device files (super-user only)
        --specials              preserve special files

-a is equivalent to -rltpgoD

njzjz commented 7 months ago

I transfer the issue to dpdispatcher as it's more related.