Azure / azure-storage-azcopy

The new Azure Storage data transfer utility - AzCopy v10
MIT License
607 stars 218 forks source link

Monotonic memory growth bug in `azcopy jobs show <jobID>` for large job, significantly worse with `--with-status` flag #2642

Open jidicula opened 5 months ago

jidicula commented 5 months ago

Which version of the AzCopy was used?

Note: The version is visible when running AzCopy without any argument

What command did you run?

Note: Please remove the SAS to avoid exposing your credentials. If you cannot remember the exact command, please retrieve it from the beginning of the log file.

What problem was encountered?

Out of memory kill from the OS

How can we reproduce the problem in the simplest way?

Run the above commands on any of the above AzCopy versions on an Ubuntu VM on a large (my scenario included 225 million files) completed job's result.

Have you found a mitigation/solution?

The only workaround I have for inspecting errors is to grep the job's logs for COPYFAILED and pipe that to a separate file for further examination:

grep COPYFAILED ~/.azcopy/<jobID>* > logged-failures.txt

I noticed that when running azure-storage-azcopy jobs show <jobID> --with-status=Failed for a large job (~370 TB over 225 million files), the command exits with 137 and a Killed stderr message. This seems to correspond to an out-of-memory error from the kernel, and it kill(9)s the azcopy process.

Is this a known bug?

Some data

I captured some really crude logs with free on an Ubuntu 22.04 ARM64 VM in Azure running nothing but azure-storage-azcopy jobs show <jobID> --with-status=Failed in a tmux session and saw that system RAM usage grows monotonically until the OS kills azcopy (haven't correlated it fully with azcopy's invocation, but azcopy definitely gets killed before my memory sample collection is complete).

I've reproduced this with various combinations of Go and AzCopy versions: Go 1.18.1 Go 1.22.2
azcopy 10.23.0 azcopy-10.23.0-go1.18.1-linux-arm64-memprofile.log azcopy-10.23.0-go1.22.2-linux-arm64-memprofile.log
azcopy 10.24.0 azcopy-10.24.0-go1.18.1-linux-arm64-memprofile.log azcopy-10.24.0-go1.22.2-linux-arm64-memprofile.log
azcopy 10.25.0-Preview-1 didn't test azcopy-10.25.0-Preview-1-go1.22.2-linux-arm64-memprofile.log

I also captured a single free sample with azcopy 10.25.0-Preview-1 and Go 1.22.2 just running azure-storage-azcopy jobs show <jobID>, and that also shows a monotonic memory increase, but the azcopy command completes before it runs out of memory: azcopy-10.25.0-Preview-1-go1.22.2-linux-arm64-summary-memprofile.log

Here's how the system memory usage for each of these scenarios looks when plotted together:

Image

marloeuyjr commented 4 months ago

Was this fixed in 10.25?