Kong / unirest-java

Unirest in Java: Simplified, lightweight HTTP client library.
http://kong.github.io/unirest-java/
MIT License
2.58k stars 591 forks source link

Allow file download to be interrupted #472

Closed gtaylor1981 closed 1 year ago

gtaylor1981 commented 1 year ago

I'm trying to download a large file (~2GB) using get(...).asFile(path) which runs on a separate thread and can take over 5 minutes to complete.

When my program exits, I try to interrupt the download thread then perform a join, but the download continues, my program does not exit and the join eventually times out.

The reason for this seems to be in FileResponse, where Files.copy(...) is used to transfer the contents of the stream to the file. This runs in an uninterruptible loop.

One suggested option I saw is to use .asFile(path, ExtendedCopyOption.INTERRUPTIBLE) but this appears unsupported by Files.copy() (see source) and throws an exception.

Another option I tried was to use .asFileAsync() and then cancel the returned Future. However, this has a separate issue where the whole file is first downloaded to memory (see https://github.com/Kong/unirest-java/issues/130) which I must avoid.

Do you have any suggestion how to interrupt the copy cleanly? Otherwise, is it something for which support could be added?

ryber commented 1 year ago

you can use void thenConsume(Consumer<RawResponse> consumer) to get direct access to the raw InputStream

gtaylor1981 commented 1 year ago

Thanks, I see what you mean... then I must re-implement the functionality of FileResponse myself, and adding the interrupt handling. That's no trouble, however I am also using a DownloadMonitor and cannot then create a MonitoringInputStream as it is package-private.

ryber commented 1 year ago

I've just published new versions with MonitoringInputStream as public.

Out of curiosity, why would you want to interrupt it anyway?

gtaylor1981 commented 1 year ago

Wow, thanks for the speedy response. It's really helped simplify my code.

As I mentioned, the files I'm downloading are large and can take 5-10 minutes to download. This is happening on a worker thread. When the program closes, the worker thread is interrupted and joined. The join will time out because Files.copy() never stops copying. Of course, eventually the connection closes and the download stops, so it's not so bad. However losing a connection during a download will then appear in the logs as an exception. I was hoping to exit more 'cleanly' using the interrupt.

I'm actually looking at another problem now, that if an exception is thrown in the .thenConsume(...) method, it will get caught in the BaseApacheClient.transformBody() class. The exception handler calls recoverBody() to return the body as a string. However this then appears to then block while it tries to read the whole ~2GB file into a string (which is then used in setParsingException()). Can this be avoided too somehow? If it's a problem that might need fixing, I'll gladly open a new issue for it.

ryber commented 1 year ago

Oh jeesh, thats just a bug, I'll take a look

ryber commented 1 year ago

In 3.14.4 I added two things to try and prevent the recovery part from getting the body:

  1. If the Content-Type of the response is a known binary type (octet-stream, images, pdfs) it skips reading the body
  2. If a UnrecoverableException is thrown it also skips it. This is what is thrown from the FileResponse class when it encounters an error. You could also throw this from the thenConsume method.
gtaylor1981 commented 1 year ago

Fantastic, thanks a lot. I'll give it a try.

gtaylor1981 commented 1 year ago

It works! However, I'm downloading from an AWS S3 Bucket and apparently that sets the Content-Type to binary/octet-stream. This is not in the official list of MIME types.

Unfortunately it therefore also not in the list of pre-defined types in ContentType.

I was able to define it myself (once-off, in a static block) as ContentType.create("binary/octet-stream", true). Maybe this info might help others in the same situation.

ryber commented 1 year ago

I'm of the opinion that if AWS is using it its official enough for me :), we can add it, actually I'll add a little logic that if the MIME type contains "binary" then its binary

gtaylor1981 commented 1 year ago

Sure! Not certain it helps, but another hint that it's a binary file (in this case) is the presence of the Content-Disposition: attachment header, indicating that it's a file to be downloaded (rather than to be parsed/printed etc)

couchoud-t commented 1 year ago

Just a random comment, but I think that while making the class public the non null checks are off. Only content is verified and the rest seems like leftovers of copy/paste. https://github.com/Kong/unirest-java/blob/cc1189427f0c0d4f7a287de20c8c4f35b0374e86/unirest/src/main/java/kong/unirest/core/MonitoringInputStream.java#L46-L49

ryber commented 1 year ago

derp 🤦

ryber commented 1 year ago

fixed the requireNotNull stuff in the last release