azcopy copy when piping to stdin is not skipping empty space

colemickens commented 4 years ago

Which version of the AzCopy was used?

Note: The version is visible when running AzCopy without any argument

I changed how I'm uploading. Azure disks are a bit slow to resize, so I'm uploading pre-sized images. However, the build artifacts are huge, so I store them zstd compressed to save massive amounts of space.

So, I'd like to be able to upload like this:

  sasurl="$(az storage blob generate-sas \
    --permissions acdrw \
    --expiry "$(date -u -d "1 hour" '+%Y-%m-%dT%H:%MZ')" \
    --account-name "${image_strg_acct}" \
    --container-name "vhd" \
    --name "${image_filename}"\
    --full-uri -o tsv)"

  zstdcat "${image_vhd}" \
    | azcopy copy "${sasurl}" --blob-type PageBlob

However, when I do this, my upload duration goes through the roof. It seems like maybe azcopy is no longer intelligently skipping over the empty chunks.

Which platform are you using? (ex: Windows, Mac, Linux)

Linux

What command did you run?

Note: Please remove the SAS to avoid exposing your credentials. If you cannot remember the exact command, please retrieve it from the beginning of the log file.

+ azcopy copy 'https://job17994.blob.core.windows.net/vhd/20.09.20200729.d3ff247.vhd?se=2020-07-30T09%3A23Z&sp=racwd&sv=2018-11-09&sr=b&sig=redacted%3D' --blob-type PageBlob

What problem was encountered?

How can we reproduce the problem in the simplest way?

Pipe a huge VHD with huge amounts of blank space into azcopy copy.

Have you found a mitigation/solution?

Extracting to disk and then uploading, but I'd prefer not to.

colemickens commented 4 years ago

also, just to be sure, does this work with Page blobs?

colemickens commented 4 years ago

It seems like maybe it only works for block blobs, ignores the page blob type argument, and then hands off to the sdk that just uploads without trying to skip empty sections: https://github.com/Azure/azure-storage-azcopy/blob/25635976913d156222cffec8ca3693fe6a0afb65/cmd/copy.go#L982

It would be insanely useful for this feature to work correctly with page blobs...

colemickens commented 4 years ago

This is the scenario. So far only blobxfer can do it that I have found:

+ zstdcat /tmp/nix-shell.LL3PRa/tmp.5NK8mbUwCW/disk.vhd.zstd
+ blobxfer upload --storage-url 'https://job13211.blob.core.windows.net/vhd/20.09.20200729.d3ff247.vhd?se=2020-07-30T10%3A48Z&sp=racwd&sv=2018-11-09&sr=b&sig=REDACTED%3D' --local-path -
2020-07-30 02:48:51.467 DEBUG - credential: account=job13211 endpoint=core.windows.net is_sas=True can_create_containers=False can_list_container_objects=False can_read_object=True can_write_object=True
2020-07-30 02:48:51.469 INFO - 
============================================
         Azure blobxfer parameters
============================================
         blobxfer version: 1.9.4
                 platform: Linux-5.7.10-x86_64-with-glibc2.2.5
               components: CPython=3.8.3-64bit azstor.blob=2.1.0 azstor.file=2.1.0 crypt=2.9.2 req=2.23.0
       transfer direction: local -> Azure
                  workers: disk=16 xfer=32 md5=0 crypto=0
                 log file: None
                  dry run: False
              resume file: None
                  timeout: connect=10 read=200 max_retries=1000
                     mode: StorageModes.Auto
                  skip on: fs_match=False lmt_ge=False md5=False
                   delete: extraneous=False only=False
                overwrite: True
                recursive: True
            rename single: False
         strip components: 0
              access tier: None
         chunk size bytes: 0
           one shot bytes: 0
         store properties: attr=False cc='' ct=<mime> md5=False
           rsa public key: None
       local source paths: -
============================================
2020-07-30 02:48:51.469 INFO - blobxfer start time: 2020-07-30 02:48:51.469250-07:00
2020-07-30 02:48:51.469 DEBUG - spawning 16 disk threads
2020-07-30 02:48:51.481 DEBUG - spawning 32 transfer threads
2020-07-30 02:48:51.489 DEBUG - 0 files 0.0000 MiB filesize, lmt_ge, or no overwrite skipped
2020-07-30 02:48:51.489 DEBUG - 1 local files processed, waiting for upload completion of approx. 0.0000 MiB
2020-07-30 02:49:45.815 INFO - elapsed upload + verify time and throughput of 48.8281 GiB: 54.329 sec, 7362.6139 Mbps (920.327 MiB/s)
2020-07-30 02:49:45.815 INFO - blobxfer end time: 2020-07-30 02:49:45.815782-07:00 (elapsed: 54.347 sec)

This allows me to have a 100GB that never has to touch the disk in full size, uploads quickly (1min = skipping blank sections), and uploads from stdin to a page blob successfully.

However, as I often do with python, I hit a packaging issue with blobxfer, so I'd love to have this functionality available in azcopy. Thanks so much!!

kf6kjg commented 1 year ago

Had same problem, though my compressor was the lowly gzip. The packaging issue for blobxfer seems to have been fixed, but I then ran across https://github.com/Azure/blobxfer/issues/144 when trying to use blobxfer. :/

colemickens commented 1 year ago

Gotta love Azure.

Azure / azure-storage-azcopy