AstarVienna / ScopeSim

A telescope observation simulator for Python.
GNU General Public License v3.0
14 stars 10 forks source link

Use `pooch` for `download_example_data` #446

Closed teutoburg closed 2 months ago

teutoburg commented 2 months ago

~TLDR~ Summary

The astropy-based download and cache functionality previously used here led to many failed runs in ScopeSim_Data. This should hopefully improve now with pooch, which among other applications is specifically advertised for use on "example data". It doesn't really add an additional dependency, because we're already using pooch as part of spextra, and (at least in testing) also through SciPy (the raccoon face image).

On the download_example_data function

There is an argument to be made about removing the optional url argument, because we can change the server URL through the cmds anyway (hopefully we don't have to...) and it's unlikely anyone would ever use this function to download different data from somewhere else. But I'm also fine with keeping it.

I changed the default cache location to ~.astar/scopesim/, in line with e.g. skycalc_ipy. Caching should also work better now with pooch, see below.

On hashes and caches

Paraphrasing from their documentation, pooch uses hashes to determine if the requested file on the server is different from the one stored in the cache and thus whether it needs to be downloaded. In all setups that are not "quick and dirty script mode", it refuses to download files without a known hash provided. In theory, pooch also supports (and encourages the use of) versioning, but I think that only works when the data is stored on a remote repo, e.g. GitHub, which we're explicitly avoiding because of file sizes involved.

Storing the hashes in a separate file is described in their docs, although I suppose we're still well below having a "large number of files", so we might as well put the hashes in a dict. I don't think it'll make a practical difference. If we modify or add files in the future, we can simply change or add the corresponding hash and everything should work just fine.

Now I'm not sure how the caching behavior affects ScopeSim_Data, but we'll see...

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Project coverage is 75.58%. Comparing base (6b2a07c) to head (e5fc68c).

Files Patch % Lines
scopesim/server/example_data_utils.py 18.75% 13 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #446 +/- ## ========================================== + Coverage 75.50% 75.58% +0.08% ========================================== Files 64 64 Lines 7833 7824 -9 ========================================== Hits 5914 5914 + Misses 1919 1910 -9 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hugobuddel commented 2 months ago

I suppose the idea behind pooch is to put the hash in the code (and thereby in the git commit hash etc.), so you know what file you are going to get for reproducibility (and perhaps security).

If we don't care about reproducibility or security (and perhaps we don't) but only data integrity, then we don't need this extra hash but can instead rely on the checksums that could/should be in the header of the FITS files (e.g. DATAMD5,CHECKSUM,DATASUM).

So I don't see a good reason for having the hashes in a separate file. It doesn't really cause any problems either, so whatever.

hugobuddel commented 2 months ago

But using pooch is obviously better than what we had, since it actually works, so don't pay too much attention to my comment about the hashes.

teutoburg commented 2 months ago

It seems the download function isn't actually covered by our tests. We could add a test file on the server and cache it to a temp dir in a test (indeed a webtest), that way it would be downloaded each time...

teutoburg commented 2 months ago

Should we nevertheless add pooch to the dependencies in the pyproject.toml file?

teutoburg commented 2 months ago

I should read before I type

Goes for me as well it seems. Pooch already was in our pyproject.toml dependencies, probably for the raccoon face thing. Oops.

Anyway, after reading through their changelog, I think it's worth upgrading to the latest version, so I did.

hugobuddel commented 2 months ago

Oops, now the METIS notebooks cannot find the downloaded files anymore: https://github.com/AstarVienna/ScopeSim_Data/actions/runs/10233670841/job/28312246596

FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 input_hdul = fits.open("HL_Tau_prep_for_Scopesim.fits")

Apparently the notebooks expects the downloaded file to be present in the current working directory.

Should the notebooks be fixed and use the returned paths, or should download function place copies/symlinks in the current working directory?

I think we should update the notebooks.

hugobuddel commented 2 months ago

I should read before I type

Goes for me as well it seems. Pooch already was in our pyproject.toml dependencies, probably for the raccoon face thing. Oops.

Anyway, after reading through their changelog, I think it's worth upgrading to the latest version, so I did.

Did this perhaps break the windows update-to-latest-dependencies test? https://github.com/AstarVienna/ScopeSim/actions/runs/10233695289/job/28316098713

teutoburg commented 2 months ago

Did this perhaps break the windows update-to-latest-dependencies test? https://github.com/AstarVienna/ScopeSim/actions/runs/10233695289/job/28316098713

Running poetry update locally on Win11 Py3.11 works fine (from current main).

However, on my Win10 machine that I recently upgraded from Py3.9 to 3.12, I get weird errors:

teutoburg commented 1 month ago

However, on my Win10 machine that I recently upgraded from Py3.9 to 3.12, I get weird errors:

  • Simple poetry install (from lock file) fails around pyerfa. Both on main and on 6b2a07c, before the recent changes. Seems to be a problem on that machine caused by the 3.9->3.12 update...
  • Running poetry update solves the pyerfa, but I get an error from matplotlib instead. Attempting to just update this pyerfa, it now fails on synphot. I think I broke my installation here 😑

This should be solved by #450