dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
775 stars 178 forks source link

Exclude test directory from distributed wheel #516

Open gs11 opened 4 years ago

gs11 commented 4 years ago

When trying to optimize the speed for a serverless/lambda deployment I found that the fastparquet wheel contains a test folder of ~80 Mb.

Could this be excluded from the distribution as I presume it's not needed for that?

martindurant commented 4 years ago

Sounds reasonable. Would you like to make this change?

gs11 commented 4 years ago

Sure! I can do that

gs11 commented 4 years ago

Started looking into it and realized my experience with distributing python packages is virtually non-existent. When building dists locally I can't really reproduce the issue with the test folder containing the large blobs (e.g. bitbpack, rle etc).

martindurant commented 4 years ago

It actually seems not that easy to do if we want to maintain a fastparquet[tests] install. I guess we could instead tell people, that to test, they need to clone the repo.

gs11 commented 4 years ago

While not a heavy fastparquet user myself I'd say that'd be a fair tradeoff.

martindurant commented 4 years ago

One of the points in favour of fastparquet over pyarrow has been the install size, so maybe it's worth someone's time to do this (would involve messing with MANIFEST, I believe). I don't imagine getting to it soon, though.

gs11 commented 4 years ago

About size I found that in addition to the fastparquet package itself being larger, the total sum of the fastparquet dependencies were substantially larger than those of pyarrow.

Either way, I can't seem to replicate creating a distribution that has the same contents as that on pypi. What does the release process look like today?

martindurant commented 4 years ago

Releases are packaged using python setup.py sdist bdist_wheel, so what gets included depends on the contents of setup and MANIFEST.in

gs11 commented 4 years ago

The release I generate with the above command is less than half a Mb - excluding those test blobs. There's no way i can create the dist including those binary files. I can exclude the test folder altogether in MANIFEST.in but I'm not sure that'll have any effect as the release might be generated differently?