Closed emthompson-usgs closed 1 year ago
The source is not available on PyPI and this package is being built using: https://code.usgs.gov/ghsc/esi/groundmotion-processing/-/archive/v1.2.2/groundmotion-processing-v1.2.2.tar.gz
That is a ginormous download BTW, 125 MB! The Wheel on PyPI is ~25MB, which is already quite big but still much smaller than this source. It would be nice if a source distribution was published on PyPI along side the wheel and if that was a bit smaller, with just the files required to build the package. With that said, the data files are there but they are not making into the final sdist used to build the conda package. If you run:
python -m build --sdist . --outdir dist
as per your pyproject.toml you'll get an empty data directory,
gmprocess-1.2.2/src/gmprocess/data/
gmprocess-1.2.2/src/gmprocess/data/__init__.py
maybe more files are missing but I'm not familiar with the package to know. You are using setuptools so you may fix that by using this https://setuptools.pypa.io/en/stable/userguide/datafiles.html [1].
Note that you are not building your wheel with the standards in your pyproject.toml! If you were using build and the metadata there you would get an empty data directory too! (You can try that by downloading that version number source and typing: python -m build --wheel . --outdir dist
).
TL;DR it is a problem upstream and you can fix it with [1] and/or adding a MANIFEST.in file.
Thanks for looking at this. The source distribution gets rejected by pypi because of the size. The issue is that it includes the test data, whereas the wheel does not. The only way I can think of to fix the size would be to put the test data somewhere else.
I had been building the wheel with
python -m build
So I don't know why the data contents were included (leading me to think everything was okay in this regard)
$ unzip -l dist/gmprocess-1.2.3.dev0-py3-none-any.whl | grep gmprocess/data
616 08-15-2022 15:45 gmprocess/data/CESMD_NGA_ids.csv
41885 08-15-2022 15:45 gmprocess/data/GDMSstations.json
1122213 08-15-2022 15:45 gmprocess/data/NGA_West2_SiteDatabase_V032.csv
0 08-15-2022 15:45 gmprocess/data/__init__.py
18992 12-22-2022 21:23 gmprocess/data/config_production.yml
25699 11-05-2022 22:57 gmprocess/data/config_test.yml
<snip>
My prior reading of the setuptools page that you linked to made me think that when using pyproject.toml, data files were included by default and didn't require additional specification. It sounds like I'll have to re-read it more carefully.
Quick update: I ran the same command you did to get the source distribution:
python -m build --sdist . --outdir dist
But I don't get an empty data directory:
$ tar -tvf dist/gmprocess-1.2.3.dev0.tar.gz | grep gmprocess/data
drwxr-xr-x 0 emthompson 176539137 0 Dec 23 12:06 gmprocess-1.2.3.dev0/src/gmprocess/data/
-rw-r--r-- 0 emthompson 176539137 616 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/CESMD_NGA_ids.csv
-rw-r--r-- 0 emthompson 176539137 41885 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/GDMSstations.json
-rw-r--r-- 0 emthompson 176539137 1122213 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/NGA_West2_SiteDatabase_V032.csv
-rw-r--r-- 0 emthompson 176539137 0 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/__init__.py
drwxr-xr-x 0 emthompson 176539137 0 Dec 23 12:06 gmprocess-1.2.3.dev0/src/gmprocess/data/asdf/
<snip>
So I'm wondering why there is this difference in behavior in my install. Could it be the version of setuptools or build? Here's what I have:
build 0.9.0 pypi_0 pypi
setuptools 65.6.3 pyhd8ed1ab_0 conda-forge
Also, I will work on making the source distribution smaller. Some of these data files definitely don't need to be there, and I can also exclude the tests and docs directories which should shave off a ton of space.
For the latest release (1.2.3) the pypi source and wheel distributions are now much smaller (~5.6 MB) and so the source distribution is not rejected by pypi. It occurs to me now that the source url in the recipe/meta.yml file points to the code.usgs.gov tar.gz of the source, which is still large since that is simply a tar of the repo and not the result of python -m build
. So I am thinking I'll change the URL to point to the pypi-hosted source distribution.
put the test data somewhere else.
Yep most projects serve the data on GH and have a script to download it at test time [1]. Other strategies may be, if possible, to auto-generate the test data.
[1] one pattern I like is to use pooch to fetch it. See https://github.com/Unidata/MetPy/blob/6f62696b9a1bb338a32ad0a8b801087941d5cc43/src/metpy/cbook.py#L32 for an example.
So I am thinking I'll change the URL to point to the pypi-hosted source distribution.
:+1:
Solved with this update. Thanks @ocefpaf!
Solution to issue cannot be found in the documentation.
Issue
Hi @ocefpaf, I got an error report from a user via email and it was clear that the cause was that a file in the data subpackage wasn't being found. Note, we recently switched from primarily using
setup.py
topyproject.toml
for setup options and when we did this it didn't seem like it was necessary to specify package data, but surely this is where something has gone wrong.The most minimal way I found to recreate the error is with
which raises an error because it can't find
~/miniconda3/envs/build/lib/python3.10/site-packages/gmprocess/data/config_production.yml
.I was not able to reproduce the error when I install from source or via pip, but I was able to reproduce it when I install from conda. I confirmed that the contents of
src/gmprocess/data
is missing from the install directory. Here's my code to reproduce:I'm guessing we need to specify something in
pyproject.toml
or in the feedstock recipe to indicate that we want the data files in this directory to be included. I'm hoping you can point me in the right direction. Thanks.Installed packages
Environment info