deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
3.96k stars 630 forks source link

The security considerations of automatic library downloads #2632

Open docent opened 1 year ago

docent commented 1 year ago

Hello. I'm quite a seasoned Java developer and new to djl, pythorch etc. I was really happy to found djl to use for my experiments with machine learning, since I prefer to program in Kotlin. After setting up and running a simple test project, I was surprised to see the following output:

15:37:55.599 [main] DEBUG ai.djl.engine.Engine -- Registering EngineProvider: PyTorch
(...)
15:37:56.549 [main] INFO ai.djl.pytorch.jni.LibUtils -- Downloading https://publish.djl.ai/pytorch/1.13.1/cpu/win-x86_64/native/lib/torch.dll.gz ...
(...)

In my experience it's very rare, that, in Java world a third party library starts automatic downloads of anything, let alone native code. I can't imagine, say, Spring Framework or any other respected library doing something like that, especially by default. What if the files that you are hosting get compromised somehow? As far as I see there is no verification of checksums. Can these native libraries be trusted? These are all valid questions.

In fact, I see no reason why a machine learning library should access the Internet at all. If anything, these files should be downloaded as a Maven/Gradle dependency, or at least by a separate Maven/Gradle task. And definitely not by default.

I'd appreciate hearing your thoughts about these concerns. Thank you.

frankfliu commented 1 year ago

@docent First of all, DJL can run offline perfectly fine. You can use -Doffline=true to enable offline mode and DJL will not trying to download native libraries from network. see: https://docs.djl.ai/docs/demos/development/fatjar/index.html#testing-offline-mode

In order to run in offline mode, you need to include offline native distribution library in your maven/gradle project. Use PyTorch Linux GPU as example: https://docs.djl.ai/engines/pytorch/pytorch-engine/index.html#linux

The native library is compiled to a specific target platform, there are many different combination:

  1. mac, Linux, Windows
  2. aarch64, x86_64, arm64
  3. different version if glibc
  4. CUDA version: cu102, cu113, cu117, cu118 ...

If you want your application can be run on all platforms, the total size of offline native maven jar file is more than 60G. That's why many of our users choose to use our runtime download option. DJL will detect your platform and only download necessary library for you application.

docent commented 1 year ago

@frankfliu thanks on the info on the offline mode, I'll take a look. I'm wondering tho, why the system property is named just "offline" as opposed to, say "ai.dj.offline". It looks like it can easily and unintentionally clash with application properties. It's not infeasible that some adds a property like that to an application (or a library), causing unintended behavior.