Open akrherz opened 4 years ago
EEEEK! That's no bueno. That's no rant, that's...super annoying and would definitely cause problems elsewhere in no-internet environments.
My preferred solution, though, would be to make the code that loads the station file more lazy. You shouldn't pay an import price for anything you're not using. Does that sound ok?
Thanks @dopplershift for the gentle response. I think lazy loading would be the simplest and most direct thing to do, but I still wonder about the usage of get_test_data()
and the environment variable TEST_DATA_DIR
to control this loading of non-test data. For testing, all staticdata files are already provided, right? So pooch doesn't really fetch anything in the case of testing?
I dunno, this gets more into design and implementation, but why is pooch even needed here for how it is currently being used?
Well, staticdata
is also there for examples. I can see a case to be made that for things needed for metpy features, we should just be shipping the data files needed; but then, do we really want to be shipping the state/county shapefiles?
For me I am going to use your shapefiles to generate Metpy maps. That is something in my case that I want to have access to. My goal will be get any weather data in GIS format from NOAA web site to display the data. As for station information I was also thinking of getting it from Aviation Weather web site as a CSV file. Sometimes we just want a subset of METAR reports not the full dataset. I find it much quicker to retrieve METAR reports from Aviation Weather where no parsing is needed. I find the parsing takes too long to process.
@eliteuser26 For the county/state shapefiles we won't make it any less accessible than it already is. The question is whether they should be part of the install package, or only downloaded (and cached) when a user actually tries to use the feature.
@dopplershift I suppose I can download the shapefiles manually and save it locally. In this case it would only be done once. As for the airport stations information I have been saving this into a MySQL database. The problem with this is that I need to keep it updated. One less thing to download I guess.
Let me clarify: the current situation is that MetPy, through the use of pooch, automatically downloads the shapefiles for you—no manual intervention needed. The only alternative I’m considering is including them in the metpy package files.
Regardless, there will be no manual user intervention needed here. Either MetPy will download the files for you, or MetPy will ship with the files.
@dopplershift If it downloads the shapefiles by default then that would be better. At least it will provide the latest version of the shapefiles.
We just ran into this issue (literally as GitHub has been having issues today)
Random runtime errors due to a hidden external service are no good:
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
I'd much rather have them bundled, storage is cheap. +1 from me.
sigh Yeah, just encountered it myself, though it seems to be resolved or at least working somewhat for me. But yes, we'll have to do something about this so that regardless it's not a blocker to using basic MetPy functionality.
The only alternative I’m considering is including them in the metpy package files.
A third option would be to bundle this data in an optional, data-only package. This way, the data could be updated on a separate timeline from the main package, and explicitly downloaded before moving offline.
So while storage is cheap, we don't want people to have to download a 100M package for MetPy alone--packages like that are the ones I swear at when I'm teaching a workshop and stuck on shotty wifi.
@MinchinWeb Yeah, I'm coming around to that as an option. Just want to do this in a way that doesn't add too much testing burden on us.
So the data files used by features (not tests/examples) in MetPy:
So we're looking at 35M of data files right now on disk (18M compressed). Package sizes from anaconda.org:
So this would increase our package size to something comparable with downloading SciPy, which seems big (we're currently 720K)...but it would also eliminate some potential unexpected network requirements. It's probably a mistake that we use a function called get_test_data
to implement features in the code base. Anyone have arguments for bundled together vs. separate package? The separate package approach doesn't seem to be as frequently done on PyPI, but does have some use on Anaconda.
I recently ran into a similar frustration. To piggyback on what @akrherz originally said, I, too, run in environments that can be a bit more restricted than typical users. A current use case of mine includes using MetPy as part of an embedded flask application in Apache. I create virtual environments using venv
. I think in normal cases, the files are downloaded to a share/
directory that is writable. For my case, and what I think is related to it being an Apache-embedded process, I get the following error:
PermissionError: [Errno 13] Permission denied: '/usr/share/httpd/.cache' | Pooch could not create data cache folder '/usr/share/httpd/.cache/metpy/v1.0.1'. Will not be able to download data files.
If it would write to the share/
folder in the virtual environment it would be fine, but it instead tries to write to locked down system folder. Ultimately, I do not think I even use the file, but my application chokes because it is not present and cannot be downloaded and written to disk. Setting TEST_DATA_DIR
is not really an optimal solution in my case either, though I may have to use it to at least run in the meantime. I actually looked to see in the staticdata was placed in its own package as is done from time to time.
Just to be clear, on Linux it's trying to write to $HOME/.cache
, so the problem there is that $HOME for Apache is set to /user/share/httpd
.
@nawendt What files are you encountering problems with it downloading?
The file was sfstns.tbl
which seems to be brought in because I import from metpy.io._tools import IOBuffer, NamedStruct
.
The file was
sfstns.tbl
which seems to be brought in
@nawendt you sure that happened with current MetPy? I thought that was fixed with #1840
Well, dang. I feel silly. I was on old version because of it being RHEL7. My bad.
No worries. I do feel like this is still a problem overall, I'm just not satisfied with any of the solutions so far.
Sorry to revive a zombie issue, but is there a solution for this? I was trying to run a Cartopy/Metpy example on a node here at my center that doesn't have internet access (on a supercomputing cluster). I was able to stage the CARTOPY static data using:
CARTOPY_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/CartopyDownloads
(Bad directory name but I can work on that.)
But I can't seem to get MetPy to see staticdata. I cloned the MetPy repo, then moved the staticdata/
directory to /discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata
.
But it doesn't matter if I set:
TEST_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata
or even:
TEST_DATA_DIR=/discover/nobackup/projects/gmao/SIteam/metpy-static-data
it keeps trying to download the data that is there on disk.
Is there a different environment variable to use?
That's weird. TEST_DATA_DIR
is what we're using in CI (on GitHub Actions) to accomplish the same thing.
I just confirmed on my mac that if I do:
TEST_DATA_DIR=/Users/rmay/repos/metpy/staticdata python NEXRAD_Level_3_File.py
no data are downloaded. (But just doing python NEXRAD_Level_3_File.py
produces messages about downloading data.)
How are you setting the value for TEST_DATA_DIR
?
I was doing:
prepend_path("TEST_DATA_DIR","/discover/nobackup/projects/gmao/SIteam/metpy-static-data/staticdata")
in my modulefile. Maybe I'll try again when I'm back at work...
My blissful world includes running python within environments with no internet access, jailed HTTP web services and python processes with read-only state. When I install MetPy from conda-forge, I manually edit
cbook.py
to remove all the POOCH usage as I lamely do not want to deal with it and the warnings it will spew tostdout/stderr
with the envs above.With metpy=0.12, it now will auto-load the
airport-codes.csv
as part ofmetpy.io.station_data
import sincemetpy.io
imports*
. For example:So all I want is just
from metpy.io.nexrad import Level3File
, but am having to download an unrelated station file.Would you consider just batteries-including these files that are required for MetPy functionality vs testing? Having to set the
TEST_DATA_DIR
environment variable is not very intuitive just to get a running MetPy in the environments above.So there I ranted, sorry.