apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[Java] JDBC driver too large #40180

Open ryanhamilton opened 9 months ago

ryanhamilton commented 9 months ago

Describe the bug, including details regarding any error messages, version, and platform.

If you add flight-sql-jdbc-driver as a dependency to a java appllication it pulls in 70MB of dependencies including 8 platform specific versions of netty and 1000s of files. H2 database which includes a JDBC driver and a database engine is 2.5MB. If you want smaller, here is JDBC implemented as a single file (https://github.com/KxSystems/kdb/blob/master/c/jdbc.java), this is used by 100s of users globally.

I admire the idea of a common data format but tying it to a 70MB verbose implementation when someone just wants a java driver isn't convincing me.

Eclipse, apache, google, gson, flatbuffer - it's basically every popular java dependency in the world.

image

mozilla\public-suffix-list.txt

// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

// Please pull this list from, and only from https://publicsuffix.org/list/public_suffix_list.dat,
// rather than any other VCS sites. Pulling from any other URL is not guaranteed to be supported.

// Instructions on pulling and using this list can be found at https://publicsuffix.org/list/.

// ===BEGIN ICANN DOMAINS===

// ac : http://nic.ac/rules.htm
ac
com.ac
edu.ac
gov.ac
net.ac
mil.ac
org.ac

// ad : https://en.wikipedia.org/wiki/.ad
ad
nom.ad

// ae : https://tdra.gov.ae/en/aeda/ae-policies
ae
co.ae
net.ae
org.ae
sch.ae
ac.ae

Platform specific netty takes 13MB.

image

Component(s)

Java

jduo commented 9 months ago

Would it help to have an unshaded JAR available @ryanhamilton ? At the least, it'd reduce the download size when dependencies are already part of the calling application.

See #37892 .

ryanhamilton commented 9 months ago

My interest was because I provide a free SQL IDE: https://www.timestored.com/qstudio/ It bundles common/small drivers and automatically downloads larger. A user requested support. So I myself don't have a strong need and probably will skip supporting InfluxDB automaticaly for now. Users can download the jar themselves.

I mostly raised this issue to make you aware that some use-cases will care about

  1. Deployment size
  2. Huge number of dependencies. I wanted to warn you of this early as it's much harder to reduce size later when your library has become more popular.

On the huge number of dependencies, I provide qStudio to a number of banks. They scan qStudio,.jar for CVEs. With as large a number of dependencies as you have, if I bundled your driver it's likely to trigger a CVE alert somewhere. Really I would recommend trying to trim your dependencies.

Good luck.

alamb commented 8 months ago

I suspect the maintainers would love some help to make PRs to reduce the number of dependencies in the arrow JDBC drivers

laurentgo commented 3 weeks ago

The current driver is around 40MB and comprised roughly of 60 dependencies but those dependencies can be split in several groups: arrow, protobuf, grpc, netty, jackson so in terms of CVE, it's relatively manageable.

As for the size, yes, maybe we can carve things there and there but it requires deep analysis of the classes/resources usage by the driver. For example, the mozilla suffix list is a resource used by some guava classes to analyze URI, probably some code we are not using in the driver but not 100% sure. As for the native libraries shipped with the driver, those contain boringssl implementation to speed up TLS protocol and there's one by OS/architecture