Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.24k stars 413 forks source link

MetPy's usage of POOCH is frustrating #1320

Open akrherz opened 4 years ago

akrherz commented 4 years ago

My blissful world includes running python within environments with no internet access, jailed HTTP web services and python processes with read-only state. When I install MetPy from conda-forge, I manually edit cbook.py to remove all the POOCH usage as I lamely do not want to deal with it and the warnings it will spew to stdout/stderr with the envs above.

With metpy=0.12, it now will auto-load the airport-codes.csv as part of metpy.io.station_data import since metpy.io imports *. For example:

Traceback (most recent call last):
  File "pyWWA/parsers/nexrad3_attr.py", line 8, in <module>
    from metpy.io.nexrad import Level3File
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/__init__.py", line 12, in <module>
    from .metar import *  # noqa: F403
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/metar.py", line 15, in <module>
    from .station_data import station_info
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/station_data.py", line 140, in <module>
    station_info = StationLookup()

So all I want is just from metpy.io.nexrad import Level3File, but am having to download an unrelated station file.

Would you consider just batteries-including these files that are required for MetPy functionality vs testing? Having to set the TEST_DATA_DIR environment variable is not very intuitive just to get a running MetPy in the environments above.

So there I ranted, sorry.

dopplershift commented 4 years ago

EEEEK! That's no bueno. That's no rant, that's...super annoying and would definitely cause problems elsewhere in no-internet environments.

My preferred solution, though, would be to make the code that loads the station file more lazy. You shouldn't pay an import price for anything you're not using. Does that sound ok?

akrherz commented 4 years ago

Thanks @dopplershift for the gentle response. I think lazy loading would be the simplest and most direct thing to do, but I still wonder about the usage of get_test_data() and the environment variable TEST_DATA_DIR to control this loading of non-test data. For testing, all staticdata files are already provided, right? So pooch doesn't really fetch anything in the case of testing?

I dunno, this gets more into design and implementation, but why is pooch even needed here for how it is currently being used?

dopplershift commented 4 years ago

Well, staticdata is also there for examples. I can see a case to be made that for things needed for metpy features, we should just be shipping the data files needed; but then, do we really want to be shipping the state/county shapefiles?

eliteuser26 commented 4 years ago

For me I am going to use your shapefiles to generate Metpy maps. That is something in my case that I want to have access to. My goal will be get any weather data in GIS format from NOAA web site to display the data. As for station information I was also thinking of getting it from Aviation Weather web site as a CSV file. Sometimes we just want a subset of METAR reports not the full dataset. I find it much quicker to retrieve METAR reports from Aviation Weather where no parsing is needed. I find the parsing takes too long to process.

dopplershift commented 4 years ago

@eliteuser26 For the county/state shapefiles we won't make it any less accessible than it already is. The question is whether they should be part of the install package, or only downloaded (and cached) when a user actually tries to use the feature.

eliteuser26 commented 4 years ago

@dopplershift I suppose I can download the shapefiles manually and save it locally. In this case it would only be done once. As for the airport stations information I have been saving this into a MySQL database. The problem with this is that I need to keep it updated. One less thing to download I guess.

dopplershift commented 4 years ago

Let me clarify: the current situation is that MetPy, through the use of pooch, automatically downloads the shapefiles for you—no manual intervention needed. The only alternative I’m considering is including them in the metpy package files.

Regardless, there will be no manual user intervention needed here. Either MetPy will download the files for you, or MetPy will ship with the files.

eliteuser26 commented 4 years ago

@dopplershift If it downloads the shapefiles by default then that would be better. At least it will provide the latest version of the shapefiles.

four43 commented 4 years ago

We just ran into this issue (literally as GitHub has been having issues today)

Random runtime errors due to a hidden external service are no good:

ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl

I'd much rather have them bundled, storage is cheap. +1 from me.

dopplershift commented 4 years ago

sigh Yeah, just encountered it myself, though it seems to be resolved or at least working somewhat for me. But yes, we'll have to do something about this so that regardless it's not a blocker to using basic MetPy functionality.

MinchinWeb commented 3 years ago

The only alternative I’m considering is including them in the metpy package files.

A third option would be to bundle this data in an optional, data-only package. This way, the data could be updated on a separate timeline from the main package, and explicitly downloaded before moving offline.

dopplershift commented 3 years ago

So while storage is cheap, we don't want people to have to download a 100M package for MetPy alone--packages like that are the ones I swear at when I'm teaching a workshop and stuck on shotty wifi.

@MinchinWeb Yeah, I'm coming around to that as an option. Just want to do this in a way that doesn't add too much testing burden on us.

dopplershift commented 3 years ago

So the data files used by features (not tests/examples) in MetPy:

So we're looking at 35M of data files right now on disk (18M compressed). Package sizes from anaconda.org:

So this would increase our package size to something comparable with downloading SciPy, which seems big (we're currently 720K)...but it would also eliminate some potential unexpected network requirements. It's probably a mistake that we use a function called get_test_data to implement features in the code base. Anyone have arguments for bundled together vs. separate package? The separate package approach doesn't seem to be as frequently done on PyPI, but does have some use on Anaconda.

nawendt commented 2 years ago

I recently ran into a similar frustration. To piggyback on what @akrherz originally said, I, too, run in environments that can be a bit more restricted than typical users. A current use case of mine includes using MetPy as part of an embedded flask application in Apache. I create virtual environments using venv. I think in normal cases, the files are downloaded to a share/ directory that is writable. For my case, and what I think is related to it being an Apache-embedded process, I get the following error:

PermissionError: [Errno 13] Permission denied: '/usr/share/httpd/.cache' | Pooch could not create data cache folder '/usr/share/httpd/.cache/metpy/v1.0.1'. Will not be able to download data files.

If it would write to the share/ folder in the virtual environment it would be fine, but it instead tries to write to locked down system folder. Ultimately, I do not think I even use the file, but my application chokes because it is not present and cannot be downloaded and written to disk. Setting TEST_DATA_DIR is not really an optimal solution in my case either, though I may have to use it to at least run in the meantime. I actually looked to see in the staticdata was placed in its own package as is done from time to time.

dopplershift commented 2 years ago

Just to be clear, on Linux it's trying to write to $HOME/.cache, so the problem there is that $HOME for Apache is set to /user/share/httpd.

@nawendt What files are you encountering problems with it downloading?

nawendt commented 2 years ago

The file was sfstns.tbl which seems to be brought in because I import from metpy.io._tools import IOBuffer, NamedStruct.

akrherz commented 2 years ago

The file was sfstns.tbl which seems to be brought in

@nawendt you sure that happened with current MetPy? I thought that was fixed with #1840

nawendt commented 2 years ago

Well, dang. I feel silly. I was on old version because of it being RHEL7. My bad.

dopplershift commented 2 years ago

No worries. I do feel like this is still a problem overall, I'm just not satisfied with any of the solutions so far.

mathomp4 commented 1 month ago

Sorry to revive a zombie issue, but is there a solution for this? I was trying to run a Cartopy/Metpy example on a node here at my center that doesn't have internet access (on a supercomputing cluster). I was able to stage the CARTOPY static data using:

CARTOPY_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/CartopyDownloads

(Bad directory name but I can work on that.)

But I can't seem to get MetPy to see staticdata. I cloned the MetPy repo, then moved the staticdata/ directory to /discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata.

But it doesn't matter if I set:

TEST_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata

or even:

TEST_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/metpy-static-data

it keeps trying to download the data that is there on disk.

Is there a different environment variable to use?

dopplershift commented 4 weeks ago

That's weird. TEST_DATA_DIR is what we're using in CI (on GitHub Actions) to accomplish the same thing.

I just confirmed on my mac that if I do:

TEST_DATA_DIR=/Users/rmay/repos/metpy/staticdata python NEXRAD_Level_3_File.py

no data are downloaded. (But just doing python NEXRAD_Level_3_File.py produces messages about downloading data.)

How are you setting the value for TEST_DATA_DIR?

mathomp4 commented 4 weeks ago

I was doing:

prepend_path("TEST_DATA_DIR","/discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata")

in my modulefile. Maybe I'll try again when I'm back at work...