datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
353 stars 60 forks source link

Library Usage Comes With Almost 1 GiB of Dependencies #213

Closed janhicken closed 3 weeks ago

janhicken commented 1 month ago

When adding the datacontract-cli package as a dependency to a Python project, a lot of transitive dependencies get added. After adding the dependency, my application's Docker image grew from 330 MiB to 1.2 GiB in size.

My application only uses SodaCL in conjunction with a PostgreSQL database, however other frameworks like pyspark (340 MB), pyarrow (123 MB) and deltalake (75 MB) are integrated as well.

Would it be possible to split the packages per target technology like Soda does it? Instead, maybe Extras can be used for this as well.

RobertLD commented 1 month ago

@simonharrer I think it's likely worth splitting a lot of these packages into optional imports aka extras?

jochenchrist commented 1 month ago

I think this is a fair point now, and we should add extras.

RobertLD commented 1 month ago

I think this is a fair point now, and we should add extras.

Drafting out the changes here #234

RobertLD commented 3 weeks ago

@jochenchrist follow-up on this. Moving deltalake into an extra cut out 200MB of those larger deps I mentioned in the previous PR. I posted a PR here ~#240~ #242

(I remade the PR because rebasing is hard haha)

I think 1.5Gb -> 300mb ought to be enough to close this issue out

jochenchrist commented 3 weeks ago

I'd agree, with 1/5 of the dependency size, the library is much more efficient now :)