astropy / pyvo

An Astropy affiliated package providing access to remote data and services of the Virtual Observatory (VO) using Python.
https://pyvo.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
75 stars 52 forks source link

Datalink getdataset too eager #554

Open msdemlei opened 3 months ago

msdemlei commented 3 months ago

Consider this script:

import pyvo

ACCESS_URL = "http://dc.g-vo.org/lswscans/res/positions/siap/siap.xml?"
svc = pyvo.sia.SIAService(ACCESS_URL)
images = svc.search((11,35), size=(0.1, 0.1))
images[0].cachedataset()

Warning: This will download at least 100 Megabytes with current pyVO.

It shouldn't, though.

This is one of the SIA1 services that return cutouts as access_urls. However, there is a datalink service on the result, too. That service only uses the full data ids, not any cutout instructions. That is, #this can be a FITS of 1 GB or so, whereas the accref can point to just a few epsilons.

Now, you can argue that if I can't make #this exactly the dataset that is described in the SIA row then I shouldn't make the datalink in the first place; but I claim this arrangement is useful in practice (e.g., people can adjust their cutouts, or pull the full image if that's what they want) and at least not totally wrong.

On the other hand, pyVO's current behaviour, that is, to prefer datalink's #this over the access reference given in the SIAP response, is weird (to say the least). I'm not even sure why we're doing it. It seems the original code was written by @funbaker, who I think has dropped out of pyVO development, so I don't think we can ask him.

Anyway, I'd suggest to prefer SIAP's access reference and only fall back to datalink if that's not going anywhere. Technically, I'd simply turn around the logic in adhoc.DatalinkRecordMixin.getdataset: try super() first and only if that exceptions out try datalink's #this.

Does anyone feel that would be grossly wrong?

andamian commented 3 months ago

Can the intersect argument be used to infer user's intentions? ENCLOSED probably means cutout whereas COVERS and CENTER does not. Not sure about OVERLAPS. But even then, there's the question of what is the default. But at least we could make it clear what it defaults to (right now is None)

msdemlei commented 3 months ago

On Thu, Jun 13, 2024 at 04:30:23PM -0700, Adrian wrote:

Can the intersect argument be used to infer user's intentions? ENCLOSED probably means cutout whereas COVERS and CENTER does not. Not sure about OVERLAPS.

Interesting idea, but what makes me immediately skeptic is that I don't think the choice of intersect is a conscious one in most cases. From what I see, most people just use the default.

But really, I think there is no good reason to second-guess the data providers' accrefs in general. If there is no accref-like column (I think that's a rare occurrence; what I can think of is something like "SELECT obs_publisher_did from ivoa.obscore"), falling back to datalink is probably a nice service, but other than that I'd say the less logic the better.