GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
874 stars 333 forks source link

gsutil rsync hangs on file transfer #497

Open srcc-chekh opened 6 years ago

srcc-chekh commented 6 years ago

I am using gsutil rsync to sync directories from a regular POSIX fileserver over to GCP buckets.

One on particular directory, gsutil rsync hangs. I'd like to figure out what it is that is making it hang. I tried running it under strace, and also with the "-D" debug mode.

example command: /root/google-cloud-sdk/bin/gsutil -D -m rsync -C -e -P -r -x '(.gnupg|.cache|.*Icon\r$)' /home01/whatever gs://bucket-name

Here is the output when it gets stuck: crc32c: u'n3lv8A==' generation: 1515639588224581 md5Hash: u'8Ci6mkl3FqcYhC2GMbfzvQ==' metadata: <MetadataValue additionalProperties: [<AdditionalProperty key: 'goog-reserved-file-mtime' value: u'1455726049'>, <AdditionalProperty key: 'goog-reserved-posix-gid' value: u'11623'>, <AdditionalProperty key: 'goog-reserved-posix-uid' value: u'11623'>, <AdditionalProperty key: 'goog-reserved-file-atime' value: u'1515540088'>, <AdditionalProperty key: 'goog-reserved-posix-mode' value: u'664'>]> name: u'test.txt' size: 25 timeCreated: datetime.datetime(2018, 1, 11, 2, 59, 48, 172000, tzinfo=<apitools.base.protorpclite.util.TimeZoneOffset object at 0x7f9948fe1a90>)>] prefixes: []> At source listing 300000... Starting synchronization process count: 20 thread count: 5 / [0/1 files][ 0.0 B/ 0.0 B]

So there is some 1 file somewhere that it somehow gets stuck on. How can I troubleshoot further? I tried stracing my command and looking for all open and access calls, hoping to hit upon the problematic file.

At some point I noticed a file that had mode 000, so I thought that was the problem, but I was not able to reproduce that, that is, making a new file with mode 000 uploads it just fine.

OK, I tried the same thing without "-m" but with "-D" and I get: ... acl: [] crc32c: u'n3lv8A==' generation: 1515639588224581 md5Hash: u'8Ci6mkl3FqcYhC2GMbfzvQ==' metadata: <MetadataValue additionalProperties: [<AdditionalProperty key: 'goog-reserved-file-mtime' value: u'1455726049'>, <AdditionalProperty key: 'goog-reserved-posix-gid' value: u'11623'>, <AdditionalProperty key: 'goog-reserved-posix-uid' value: u'11623'>, <AdditionalProperty key: 'goog-reserved-file-atime' value: u'1515540088'>, <AdditionalProperty key: 'goog-reserved-posix-mode' value: u'664'>]> name: u'test.txt' size: 25 timeCreated: datetime.datetime(2018, 1, 11, 2, 59, 48, 172000, tzinfo=<apitools.base.protorpclite.util.TimeZoneOffset object at 0x7f3a375ad890>)>] prefixes: []> Starting synchronization process count: 1 thread count: 1

and then the rsync hangs there indefinitely.

srcc-chekh commented 6 years ago

This is only a problem because I run this out of cron, and the cron jobs never complete; they just sit there at the end. And then I have to go clean them up. Would be nice to figure out what makes the rsync get "stuck".

houglum commented 6 years ago

I've seen this same issue mentioned a couple times recently, and I've run into it once myself -- it happened on a VM where I was running a long test in a loop. To "fix" the hang, we rebooted the VM, which also reinstalled gsutil upon restart... this worked, but I understand the whole turn-it-off-and-back-on-again approach shouldn't be considered the good fix :)

Given that this doesn't seem to be an isolated occurrence, I'd like to ask a few initial troubleshooting questions:

srcc-chekh commented 6 years ago

Thanks for your comment!

1) I don't seem to have anything in my .gsutil:

[root@io1 ~]# pwd; ls -alhtrR .gsutil/ /root .gsutil/: total 12K dr-xr-x--- 13 root root 4.0K Feb 1 16:35 .. -rw-r--r-- 1 root root 0 Feb 16 13:44 metrics.log drwxr-xr-x 2 root root 4.0K Feb 16 13:45 . -rw------- 1 root root 2.5K Feb 16 13:46 credstore

I will try your second suggestion.

For the third suggestion; this is a small but primary file server, so I'm not sure when the next downtime will be when I can reboot it.

srcc-chekh commented 6 years ago

Adding "-o 'GSUtil:prefer_api=xml' " makes the debug output look different, but the program still does not exit at the end.

houglum commented 6 years ago

Since this issue was filed (against v4.28, assuming you were using the most recent version), we've made several updates to our dependencies to fix deadlock issues. Most notably, we updated oauth2client -- version 4.X has different file locking functionality, which we use for synchronizing credential reads/writes. Have you seen this happen with any version >= 4.29?

srcc-chekh commented 6 years ago

some names obfuscated:

[root@io1 ~]# gsutil --version gsutil version: 4.31

[root@io1 ~]# /root/google-cloud-sdk/bin/gsutil -m rsync -C -e -P -r -x '(.gnupg|.cache|.*Icon\r$)' /home01/user gs://our-bucket Building synchronization state... At destination listing 10000... At source listing 10000... At destination listing 20000... At source listing 20000... At destination listing 30000... At destination listing 40000... At source listing 30000... At destination listing 50000... At destination listing 60000... At source listing 40000... At destination listing 70000... At destination listing 80000... At destination listing 90000... At source listing 50000... At destination listing 100000... At destination listing 110000... At source listing 60000... At destination listing 120000... At destination listing 130000... At source listing 70000... At destination listing 140000... At source listing 80000... At destination listing 150000... At destination listing 160000... At source listing 90000... At destination listing 170000... At destination listing 180000... At destination listing 190000... At source listing 100000... At destination listing 200000... At source listing 110000... At destination listing 210000... At destination listing 220000... At source listing 120000... At destination listing 230000... At destination listing 240000... At source listing 130000... At destination listing 250000... At destination listing 260000... At source listing 140000... ... At source listing 290000... At source listing 300000... Starting synchronization... Copying file:///home01/user/.bash_history [Content-Type=application/octet-stream]... | [1/2 files][530.1 KiB/530.1 KiB] 99% Done

stuck for hours. How do I figure out what that one file it is stuck on? There don't seem to be any open file handles.

houglum commented 6 years ago

Removing the -m flag will put you in single-thread mode and will result in linear, easy-to-follow debug output if you use the -D option. Assuming you're mostly sync'd, you should only have small set of file(s) to sync, so one thread should be fine; if you replace -m with -D, you can get an idea of where the rsync operation is getting stuck. (I'd also be interested to see if this deadlock still happens in single-threaded mode -- if not, it's a good indicator that our parallelism framework might have a deadlock issue somewhere.) Things to look for in the debug output might be:

Note that if you find anything useful in these logs and think it would be helpful to post them, they will contain sensitive information (local file paths, OAuth2.0 tokens in requests and their headers, etc.) that you should redact and double-check for.

srcc-chekh commented 6 years ago

I ran it without -m and with -D and with -DD and I'm not sure I can see anything useful in the logs. From the situations you described, I seem to have a "reply:" line for each "send:" line, and maybe it matches the third case?

It get stuck after these last lines:

Starting synchronization... process count: 1 thread count: 1

and sits there indefinitely After I hit Ctrl-C, it prints this before it exits:

DEBUG 0503 15:35:09.889770 metrics.py] [Metric(endpoint='https://ssl.google-analytics.com/collect', method='POST', body='cd1=rsync&cd10=1525386909880&cd2=D%2CD%2Co&cd3=C%2CP%2Ce%2Cr%2Cx&cd4=content_language%3Aen%2Cdebug%3A0%2Cdefault_api_version%3A2%2Chttps_validate_certificates%3ATrue&cd5=rsync&cd6=ControlCException&cid=dbc3d9314c6845da8ddb4eba0bf3bff2&cm1=666014&cm2=0&cm3=0&ea=rsync&ec=Command&el=4.31&ev=0&t=event&tid=UA-36037335-16&v=1', user_agent='Linux/3.10.0-693.11.1.el7.x86_64'), Metric(endpoint='https://ssl.google-analytics.com/collect', method='POST', body='cd1=rsync&cd10=1525386909880&cd2=D%2CD%2Co&cd3=C%2CP%2Ce%2Cr%2Cx&cd4=content_language%3Aen%2Cdebug%3A0%2Cdefault_api_version%3A2%2Chttps_validate_certificates%3ATrue&cd5=rsync&cd6=ControlCException&cid=dbc3d9314c6845da8ddb4eba0bf3bff2&cm1=666014&cm3=0&ea=ControlCException&ec=FatalError&el=4.31&ev=0&t=event&tid=UA-36037335-16&v=1', user_agent='Linux/3.10.0-693.11.1.el7.x86_64'), Metric(endpoint='https://ssl.google-analytics.com/collect', method='POST', body='cd1=rsync&cd10=1525386909888&cd2=D%2CD%2Co&cd3=C%2CP%2Ce%2Cr%2Cx&cd4=content_language%3Aen%2Cdebug%3A0%2Cdefault_api_version%3A2%2Chttps_validate_certificates%3ATrue&cd5=rsync&cd6=ControlCException&cd7=none&cd8=file&cd9=file%2Cgs&cid=dbc3d9314c6845da8ddb4eba0bf3bff2&cm1=1525386909888&cm10=0&cm11=0.986301763124&cm14=755924&cm3=0&cm4=20&cm5=5&cm6=0&cm9=0&ea=FileToCloud&ec=PerformanceSummary&el=4.31&ev=0&t=event&tid=UA-36037335-16&v=1', user_agent='Linux/3.10.0-693.11.1.el7.x86_64')] DEBUG 0503 15:35:09.895597 metrics.py] Metrics reporting process started...

houglum commented 6 years ago

So it gets stuck at some point after here: https://github.com/GoogleCloudPlatform/gsutil/blob/adb27ce76d020704dfcd285f337328f443c7b9db/gslib/commands/rsync.py#L1450

In single-threaded mode, you may want to try throwing in a breakpoint statement after that line -- import pdb; pdb.set_trace() -- or throw in some print statements into _RsyncFunc() or command.py's Apply() and see how far the invocation gets.

I'd like to be more helpful than recommending breakpoints or printf-style debugging, but I can't do much myself since I'm unable to trigger this with any data set I've tried.

srcc-chekh commented 6 years ago

I see it also prints this sometimes before the final metrics message: ... File "/root/google-cloud-sdk/platform/gsutil/gslib/commands/rsync.py", line 1199, in _RsyncFunc

non-dry-run mode.

  File "/root/google-cloud-sdk/platform/gsutil/gslib/sig_handling.py", line 89, in _SignalHandler
    _final_signal_handlers[signal_num](signal_num, cur_stack_frame)
  File "/root/google-cloud-sdk/platform/gsutil/gslib/sig_handling.py", line 138, in MultithreadedMainSignalHandler
    stack_trace = ''.join(traceback.format_list(traceback.extract_stack()))

...

I will try some print statements and get back to you.

srcc-chekh commented 6 years ago

OK, I think I got it:

The output of my print statements: Starting synchronization... alex: inside try process count: 1 thread count: 1 alex: inside rsyncfunc alex: file:///home01/user/cellranger-2.1.0/tiny/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/ALIGN_READS/fork0/chnk0-u44f725d7ad/files/_STARtmp/tmp.fifo.read1

[root@io1 ~]# ls -alhtr /home01/user/cellranger-2.1.0/tiny/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/ALIGN_READS/fork0/chnk0-u44f725d7ad/files/_STARtmp/tmp.fifo.read1 prw------- 1 11623 10513 0 Dec 4 15:18 /home01/user/cellranger-2.1.0/tiny/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/ALIGN_READS/fork0/chnk0-u44f725d7ad/files/_STARtmp/tmp.fifo.read1

[root@io1 ~]# file /home01/user/cellranger-2.1.0/tiny/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/ALIGN_READS/fork0/chnk0-u44f725d7ad/files/_STARtmp/tmp.fifo.read1 /home01/user/cellranger-2.1.0/tiny/SC_RNA_COUNTER_CS/SC_RNA_COUNTER/_BASIC_SC_RNA_COUNTER/ALIGN_READS/fork0/chnk0-u44f725d7ad/files/_STARtmp/tmp.fifo.read1: fifo (named pipe)

The other users for whose homedirs this happens (gsutil rsync hangs) are also users of the cellranger package (from 10x genomics) so that's an orphaned file from some failed cellranger run.

I guess maybe gsutil should be modified to skip named pipes?

houglum commented 6 years ago
  1. Yeah, we should probably skip those on rsync operations. I thought we did, but it seems not.
  2. I never would have guessed the cause was a named pipe... neat. And I'm really glad this didn't end up being a deadlock bug in our parallelism framework :)
  3. At what line in the source did you add the "alex: \<file path>" output line? Seems like something that would have been helpful to have in this situation (printing the file we're about to read, if we're in debug mode), so it's probably worth adding for ease of future debugging.
srcc-chekh commented 6 years ago

My line numbers don't seem to exactly match the repo master.

I put it just inside the line elif diff_to_apply.diff_action == DiffAction.COPY:

because I knew/guessed it was the "copy" case in that conditional.

1193 elif diff_to_apply.diff_action == DiffAction.COPY: 1194 src_url_str = diff_to_apply.src_url_str 1195 src_url = StorageUrlFromString(src_url_str) 1196 print("alex: ", src_url) 1197 cls.logger.info('alex: %s', src_url) 1198 if cls.dryrun: 1199 if src_url.IsFileUrl(): 1200 # Try to open the local file to detect errors that would occur in 1201 # non-dry-run mode. 1202 try: 1203 with open(src_url.objectname, 'rb') as : 1204 pass 1205 except Exception, e: # pylint: disable=broad-except 1206 cls.logger.info('Could not open %s' % src_url.object_name) 1207 raise 1208 cls.logger.info('Would copy %s to %s', src_url, dst_url) 1209 else:

I guess because it opens the file and gets stuck just waiting for output from the pipe, it never gets to that "Would copy %s" line, so maybe you can just put that line at the top of its if statement instead of at the end.

Thinking about it more, I'm not sure; it's going to succeed on the open() and then hang forever and that is not a bug but a feature of the pipe :) Maybe you can put a check for whether the file is a pipe before doing the open().

On Fri, May 4, 2018 at 1:50 PM Matt Houglum notifications@github.com wrote:

  1. Yeah, we should probably skip those on rsync operations. I thought we did, but it seems not.
  2. I never would have guessed the cause was a named pipe... neat. And I'm really glad this didn't end up being a deadlock bug in our parallelism framework :)
  3. At what line did you add the "alex: " line? Seems like something that would have been helpful to have in this situation (printing the file we're about to read, if we're in debug mode), so it's probably worth adding for ease of future debugging.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/gsutil/issues/497#issuecomment-386730952, or mute the thread https://github.com/notifications/unsubscribe-auth/ANDIChSc8rPQXljScU6F3ZQH12cC_I1zks5tvL8cgaJpZM4Rtp43 .

jurgenhaas commented 5 years ago

I'm seeing the same issue with named pipes. Rsync hang indefinitely. When I add them to the ignore regex, then my rsync is working just fine.

houglum commented 5 years ago

I'm not available to work on this at the moment, but if anyone (on this thread or otherwise) wants to take their swing at sending a PR, I think this is relevant part of the code where we should check if the file is a FIFO and skip it if so:

https://github.com/GoogleCloudPlatform/gsutil/blob/d7281accc526df971d0c4c66327776b5dd1299f9/gslib/commands/rsync.py#L720

We probably want to employ some logic similar to the stuff I added to storage_url to check whether something is a fifo:

https://github.com/GoogleCloudPlatform/gsutil/blob/d7281accc526df971d0c4c66327776b5dd1299f9/gslib/storage_url.py#L419

yuanzhou commented 5 years ago

I'm using gsutil rsync to backup files from a filestore NFS instance to the bucket and encountered the same issue recently. Not the first time though.

gsutil -m rsync -r -U -e -c /home gs://my-hpc-backup

I'm skipping the symbolic links and at one point it kept showing something like this:

At destination listing 20000...
At source listing 10000...
At destination listing 30000...
At destination listing 40000...
At source listing 20000...
At destination listing 50000...
At destination listing 60000...
At source listing 30000...
At destination listing 70000...

Is this still a bug or I'm not using the flags correctly?

srcc-chekh commented 5 years ago

Try something like: find /home/ -type p

yuanzhou commented 5 years ago

@srcc-chekh it didn't output anything.

adityaponnada commented 3 years ago

it looks like this issue still persists with gsutil. I am syncing data from firebase to a discovery cluster ad it is stuck on "At destination listing ..." and just takes way too much even when using -m

Ara4Sh commented 2 years ago

I have the same problem with syncing data to AWS S3. rsync hangs when the cronjob tries to transfer files to s3. -DD only shows information regarding listing objects, I think.

Command to run:

/snap/google-cloud-sdk/232/platform/gsutil/gsutil -o GSUtil:default_project_id=my-backup-project -D -o GSUtil:prefer_api=xml rsync -e -d -r gs://my-gcs-backups/ s3://my-gcs-backups/

the debug env:

gsutil version: 5.8
checksum: 12d7709aebfab92624aada4a1fc20e2c (OK)
boto version: 2.49.0
python version: 3.8.11 (default, Dec 23 2021, 04:09:42) [Clang 12.0.1 ]
OS: Linux 5.13.0-1023-gcp
multiprocessing available: True
using cloud sdk: True
pass cloud sdk credentials to gsutil: True
config path(s): /etc/boto.cfg, /root/.config/gcloud/legacy_credentials/backup@my-backup.iam.gserviceaccount.com/.boto, /root/.aws/credentials
gsutil path: /snap/google-cloud-sdk/232/bin/gsutil
compiled crcmod: True
installed via package manager: False
editable install: False
Command being run: /snap/google-cloud-sdk/232/platform/gsutil/gsutil -o GSUtil:default_project_id=my-backup-project -D -o GSUtil:prefer_api=xml rsync -e -d -r gs://my-gcs-backups/ s3://my-gcs-backups/
config_file_list: ['/etc/boto.cfg', '/root/.config/gcloud/legacy_credentials/backup@my-backup.iam.gserviceaccount.com/.boto', '/root/.aws/credentials', '/root/.aws/credentials']
config: [('working_dir', '/mnt/pyami'), ('debug', '0'), ('https_validate_certificates', 'true'), ('working_dir', '/mnt/pyami'), ('debug', '0'), ('default_project_id', 'my-backup'), ('default_api_version', '2'), ('prefer_api', 'xml')]
Building synchronization state...
process count: 2
thread count: 5

UPDATE-1 I found two files in the /tmp

gsutil-rsync-dst-hn8ts9cc  
gsutil-rsync-src-_1dl049h

I remove those files and use -m command to make the process faster. for now, everything works correctly.