Solution to issue cannot be found in the documentation.


Hi @ocefpaf, I got an error report from a user via email and it was clear that the cause was that a file in the data subpackage wasn't being found. Note, we recently switched from primarily using to pyproject.toml for setup options and when we did this it didn't seem like it was necessary to specify package data, but surely this is where something has gone wrong.

The most minimal way I found to recreate the error is with

$ gmrecords processing_steps

which raises an error because it can't find ~/miniconda3/envs/build/lib/python3.10/site-packages/gmprocess/data/config_production.yml.

I was not able to reproduce the error when I install from source or via pip, but I was able to reproduce it when I install from conda. I confirmed that the contents of src/gmprocess/data is missing from the install directory. Here's my code to reproduce:

$ conda create --name build pip gmprocess python=3.10
$ tree ~/miniconda3/envs/build/lib/python3.10/site-packages/gmprocess/data
└── __pycache__
    └── __init__.cpython-310.pyc
1 directory, 2 files

I'm guessing we need to specify something in pyproject.toml or in the feedstock recipe to indicate that we want the data files in this directory to be included. I'm hoping you can point me in the right direction. Thanks.

ocefpaf commented 1 year ago

The source is not available on PyPI and this package is being built using:

That is a ginormous download BTW, 125 MB! The Wheel on PyPI is ~25MB, which is already quite big but still much smaller than this source. It would be nice if a source distribution was published on PyPI along side the wheel and if that was a bit smaller, with just the files required to build the package. With that said, the data files are there but they are not making into the final sdist used to build the conda package. If you run:

 python -m build --sdist . --outdir dist

as per your pyproject.toml you'll get an empty data directory,


maybe more files are missing but I'm not familiar with the package to know. You are using setuptools so you may fix that by using this [1].

Note that you are not building your wheel with the standards in your pyproject.toml! If you were using build and the metadata there you would get an empty data directory too! (You can try that by downloading that version number source and typing: python -m build --wheel . --outdir dist).

TL;DR it is a problem upstream and you can fix it with [1] and/or adding a file.

emthompson-usgs commented 1 year ago

Thanks for looking at this. The source distribution gets rejected by pypi because of the size. The issue is that it includes the test data, whereas the wheel does not. The only way I can think of to fix the size would be to put the test data somewhere else.

I had been building the wheel with

python -m build

So I don't know why the data contents were included (leading me to think everything was okay in this regard)

$ unzip -l dist/gmprocess-1.2.3.dev0-py3-none-any.whl | grep gmprocess/data
      616  08-15-2022 15:45   gmprocess/data/CESMD_NGA_ids.csv
    41885  08-15-2022 15:45   gmprocess/data/GDMSstations.json
  1122213  08-15-2022 15:45   gmprocess/data/NGA_West2_SiteDatabase_V032.csv
        0  08-15-2022 15:45   gmprocess/data/
    18992  12-22-2022 21:23   gmprocess/data/config_production.yml
    25699  11-05-2022 22:57   gmprocess/data/config_test.yml

My prior reading of the setuptools page that you linked to made me think that when using pyproject.toml, data files were included by default and didn't require additional specification. It sounds like I'll have to re-read it more carefully.

emthompson-usgs commented 1 year ago

Quick update: I ran the same command you did to get the source distribution:

python -m build --sdist . --outdir dist

But I don't get an empty data directory:

$ tar -tvf dist/gmprocess-1.2.3.dev0.tar.gz  | grep gmprocess/data
drwxr-xr-x  0 emthompson 176539137      0 Dec 23 12:06 gmprocess-1.2.3.dev0/src/gmprocess/data/
-rw-r--r--  0 emthompson 176539137    616 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/CESMD_NGA_ids.csv
-rw-r--r--  0 emthompson 176539137  41885 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/GDMSstations.json
-rw-r--r--  0 emthompson 176539137 1122213 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/NGA_West2_SiteDatabase_V032.csv
-rw-r--r--  0 emthompson 176539137       0 Aug 15 09:45 gmprocess-1.2.3.dev0/src/gmprocess/data/
drwxr-xr-x  0 emthompson 176539137       0 Dec 23 12:06 gmprocess-1.2.3.dev0/src/gmprocess/data/asdf/

So I'm wondering why there is this difference in behavior in my install. Could it be the version of setuptools or build? Here's what I have:

build                     0.9.0                    pypi_0    pypi
setuptools                65.6.3             pyhd8ed1ab_0    conda-forge
emthompson-usgs commented 1 year ago

Also, I will work on making the source distribution smaller. Some of these data files definitely don't need to be there, and I can also exclude the tests and docs directories which should shave off a ton of space.

emthompson-usgs commented 1 year ago

For the latest release (1.2.3) the pypi source and wheel distributions are now much smaller (~5.6 MB) and so the source distribution is not rejected by pypi. It occurs to me now that the source url in the recipe/meta.yml file points to the tar.gz of the source, which is still large since that is simply a tar of the repo and not the result of python -m build. So I am thinking I'll change the URL to point to the pypi-hosted source distribution.

ocefpaf commented 1 year ago

put the test data somewhere else.

Yep most projects serve the data on GH and have a script to download it at test time [1]. Other strategies may be, if possible, to auto-generate the test data.

[1] one pattern I like is to use pooch to fetch it. See for an example.

So I am thinking I'll change the URL to point to the pypi-hosted source distribution.


emthompson-usgs commented 1 year ago

Solved with this update. Thanks @ocefpaf!