apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
919 stars 147 forks source link

Reduce size of utilities bundled jar #538

Open vamsikarnika opened 2 months ago

vamsikarnika commented 2 months ago

What is the purpose of the pull request

Brief change log

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

vamsikarnika commented 2 months ago

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

vinishjail97 commented 2 months ago

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

vamsikarnika commented 2 months ago

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

Hey @vinishjail97. I'm facing issues running the demos locally in my mac machine. I'm getting segmentation fault while trying to the run the below command. (I'm using M2 Mac )

java -jar xtable-utilities/target/xtable-utilities-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

Adding terminal crash report below

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               Terminal [51019]
Path:                  /System/Applications/Utilities/Terminal.app/Contents/MacOS/Terminal
Identifier:            com.apple.Terminal
Version:               2.13 (447)
Build Info:            Terminal-447000000000000~1296
Code Type:             ARM-64 (Native)
Parent Process:        launchd [1]
User ID:               501

Date/Time:             2024-09-19 14:04:33.6076 +0530
OS Version:            macOS 13.4.1 (22F770820d)
Report Version:        12
Anonymous UUID:        C6BC4607-2EAC-FD44-043D-E0ECE9D0D67E

Sleep/Wake UUID:       CE8D2B4E-2C85-4BE8-A588-C203561F81AB

Time Awake Since Boot: 65000 seconds
Time Since Wake:       1707 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x000000016ebffd00
Exception Codes:       0x0000000000000002, 0x000000016ebffd00

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [51019]

I'm seeing this error during dynamic attaching of jar.

2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from file:/tmp/hudi-dataset/people/.hoodie/metadata
2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from file:/tmp/hudi-dataset/people/.hoodie/metadata/.hoodie/hoodie.properties
2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:152 - Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from file:/tmp/hudi-dataset/people/.hoodie/metadata
# WARNING: Unable to get Instrumentation. Dynamic Attach failed. You may add this JAR as -javaagent manually, or supply -Djdk.attach.allowAttachSelf
vamsikarnika commented 2 months ago

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

Yeah, you are right. we need these deps during runtime. mvn dependency:analyze only checks dependencies required during compile time.

I've removed some of the dependencies like aws-sdk-bundle and confirmed sync is still working with s3. But after these changes jar size hasn't reduced by much.