abashev / vfs-s3

Amazon S3 driver for Apache commons-vfs (Virtual File System) project
Apache License 2.0
93 stars 50 forks source link

Copying from S3 leaves a temp file #70

Closed haakonn closed 4 years ago

haakonn commented 4 years ago

I copy files from an S3 bucket to the local file system on Linux with copyFrom, and this leaves a temp file in /tmp, named like vfs.17826129911195593952.s3. The file is a copy of the downloaded file, and is managed by S3TempFile.

I would expect this file to be deleted as soon as it's not needed anymore, but it stays indefinitely. Since I transfer a lot of files over time, this eventually causes the disk to fill up.

I immediately thought this was caused by me not closing the file object, but doing that didn't change anything. After experimenting a bit, I found that any read from the InputStream of an S3FileObject leaves this temp file behind.

Even running the bundled samples/s3-copy script leaves this file. This is seen on version 4.3.1. Adjusting the @Grab in that script, I found I could reproduce it as far back as version 4.1.0; earlier versions fail to load with "Error grabbing Grapes", so I haven't gone further back.

So reproducing it is as simple as running something like samples/s3-copy s3://access:secret@s3.eu-central-1.amazonaws.com/s3-tests-2/source.txt file:///tmp/dest.txt

I've seen this behaviour on Ubuntu 20.04 using OpenJDK 11.

haakonn commented 4 years ago

The problem relates to the reference counting in S3TempFile. It's counted up by calls to use(), and down by calls to release(). When a download with FileObject.copyFrom completes, this counter is 2 instead of the 0 required for it to get automatically deleted.

One major problem is the reliance on the garbage collector calling S3TempFile.finalize(), which calls release(). On my systems with OpenJDK 11, this never happens.

Another problem seems to be in S3FileObject.downloadOnce(). Here, the call to tempFile.getFileChannel causes a call to use() which is never matched by a release(). This can be fixed by calling release()in the inner finally block in downloadOnce. Doing this gets the reference counter down to 1.

But, again, it never reaches 0 since the garbage collector never calls S3TempFile.finalize(), and I'm not sure how to address this.

abashev commented 4 years ago

Fixed in 4.3.2