astropy / pyvo

An Astropy affiliated package providing access to remote data and services of the Virtual Observatory (VO) using Python.
https://pyvo.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
77 stars 52 forks source link

Sane behaviour for servicetype="image" #429

Open msdemlei opened 1 year ago

msdemlei commented 1 year ago

In our Registry interface, we have had an "image" service type for a long time. It has so far been a synonym for sia1. Now that our Registry discovery supports sia2, too, the question is what to do with the image service type.

Here's a few options I have been pondering:

(1) Leave things as they are. Not pretty, because why should sia2 not be image?

(2) Deprecate servicetype="image" and then one day drop it. Well, we can always do that when we find out we can't come up with useful semantics.

(3) Make image the union of sia and sia2. I think that's about the worst we can do, because several services offer both sia and sia2, and these will then be queried twice (though we can't avoid duplicate results in general; in DaCHS services, for instance, there's just one global SIA2 service but many per-data collection SIA services, and it's hard to work out which SIA results you might also get through SIA2).

(4) Make image return sia services where available and sia2 otherwise. Since I personally don't like sia2 in its current state, I'd like that a lot. But we'd have quite a bit of explaining to do if we went for that.

(5) Make image return sia2 services where available and sia otherwise. I think that is about the least surprising thing we can do.

For (3) through (5), however, we have the additional problem of what get_service() would return. For reference, SIAService's search has the signature

(pos[, size, format, intersect, verbosity]) -> SIAResults

whereas SIA2Service's has

(pos=None, band=None, time=None, pol=None, field_of_view=None, spatial_resolution=None, spectral_resolving_power=None, exptime=None, timeres=None, publisher_did=None, facility=None, collection=None, instrument=None, data_type=None, calib_level=None, target_name=None, res_format=None, maxrec=None, session=None, **kwargs) -> SIA2Results

– where pos is, I think, a rather different beast in SIA2Service than in SIAService (it is in the protocol, I've not looked at the pyVO implementation).

This means that if we want to enable something like:

for svc in registry.search(servicetype="image"):
  matches = svc.search(???)
  for row in matches:
    # do something interesting with row

I think we won't get around a compatibility shim. What could this be? We could implement SIA2's extra conditions as good as we can locally on top of SIAResults (with suffiently rich service responses, that would be reasonable for band, time, and serveral of the others; btw: you don't have any guarantees that the constraints will be honoured in SIA2 either). But, really, someone else would have to contribute code for that. Volunteers?

Me, I think I'd do a GenericImageService with a search that just has pos and size, which would be translated into a CIRCLE query against SIA2 services. The return rows, on the other hand, would be pretty much SIA2Results, which are basically obscore rows.

So, what does everyone think? Do you perhaps have better ideas? I predict our future selves will be grateful if we think about this a bit. We'll always have similar problems when we have two major versions of a protocol coexisting, and this will probably happen quite a few times in the VO's future.

Should we have a side meeting on this in Bologna?

bsipocz commented 1 year ago

I'm very much leaning towards option 3) and leave it to the user to decide which one to use. Then the second choice is 5).

But I fully agree that it's something to be raised for input from the archives, especially from those who decide to serve both versions of the protocol. I would expect while some may prefer SIA v1, others very much prefer SIA v2.

bsipocz commented 1 year ago

Btw, this discussion is probably not limited to python/pyvo, but also I wonder whether it should be the same for e.g. the Gavo webform of the registry when one selects "Image Access"? Which options does it do currently?

msdemlei commented 1 year ago

On Thu, Mar 02, 2023 at 11:23:39AM -0800, Brigitta Sipőcz wrote:

Btw, this discussion is probably not limited to python/pyvo, but also I wonder whether it should be the same for e.g. the Gavo webform of the registry when one selects "Image Access"? Which options does it do currently?

WIRR (the gavo webform) uses standard_id LIKE '%', which does not make a difference between sia1 and sia2. For WIRR, that's perhaps not a disaster, since people will manually shove around access URLs. When sending services to TOPCAT via SAMP, TOPCAT does its own Registry query, and it will switch versions based on what it sees. I've not tried which one it prefers when there's both versions on a record, but since TOPCAT has basically the "minimal common SIA interface" proposed above, that doesn't matter much.

I don't know what Aladin does -- I suppose we should ask Pierre?

If the pyVO release planning allows this: should we take this to the apps session in Bologna and hope for coffee break discussions to figure out the right way to deal with this?

bsipocz commented 1 year ago

If the pyVO release planning allows this: should we take this to the apps session in Bologna and hope for coffee break discussions to figure out the right way to deal with this?

Yes, I think that's reasonable to bring up as a discussion item.

Also, I don't think this should necessarily hold up v1.5, currently, I consider #428 and #386 blockers for the release, but nothing else.

msdemlei commented 1 year ago

On Tue, Mar 07, 2023 at 10:36:35AM -0800, Brigitta Sipőcz wrote:

Yes, I think that's reasonable to bring up as a discussion item.

Who's going to go for it? Frankly, I'd much rather have it in Apps than in Registry (where it probably tends to be ignored). @andamian -- can you see yourself volunteering? I'd be happy to help out.

Also, I don't think this should necessarily hold up v1.5, currently, I consider #428 and #386 blockers for the release, but nothing else.

Ah yes, getting #386 merged would be great. I'm happy to rebase and polish as soon as @tomdonaldson gives me a green light.

For #428, sorry about bypassing setup.cfg. Should be better now, and of course feel free to further re-arrange the plumbing as you see fit.

andamian commented 1 year ago

Yes, we can probably find a spot in an Apps session for this discussion.

Without being very familiar with SIA1, I've always considered SIA2 as a natural evolution of the standard that will eventually supersede version 1. In that case, (5) made the most sense. However, recent SIA2 refactoring and the comments around it make them look like competing versions of the same standard. Is that correct? If so, (3) is the best option.

msdemlei commented 1 year ago

On Wed, Mar 08, 2023 at 10:28:35AM -0800, Adrian wrote:

Without being very familiar with SIA1, I've always considered SIA2 as a natural evolution of the standard that will eventually supersede version 1. In that case, (5) made the most sense. However, recent SIA2 refactoring and the comments around it make them look like competing versions of the same standard. Is that correct? If so, (3) is the best option.

Unfortunately, there is not terribly much consensus about this. SIAv2 was basically a political compromise that went through based on a "we need it for cubes and because we want some parameter-based protocol" narrative. As so often, there was some truth to that narrative, but probably not enough to make it really true, which is part of the reason for the lackluster takeup.

Be that as it may, since SIAv2 is now there and at least in combination with datalink it's certainly not less powerful than SIAv1, I suspect we should make a drive to sunset SIAv1 with its ugly ad-hoc UCDs and missing support for time constraints. Perhaps that's an agenda Apps could officially petition DAL to pursue?

I guess what I'm trying to say is: I'd suggest we at least pretend that SIAv1 services won't be around any more 10 years from now.

msdemlei commented 1 year ago

I've given a talk on that at the Bologna Interop (https://wiki.ivoa.net/internal/IVOA/InterOpMay2023Apps/sia-notes.pdf), and my conclusion from having prepared the talk and the reactions at the conference is: servicetype="image" was a mistake because it suggests an abstraction far too leaky to be useful (cf. https://en.wikipedia.org/wiki/Leaky_abstraction).

I suppose the intention was (or would now be) to enable an all-VO image search in that way. But such a thing is a lot more complicated that just looking for a certain kind of service. For one, you'd still be missing ObsTAP, and I've not even started with de-duplication, unifying interfaces, and so on.

Hence, here is what I propose:

(1) both servicetype="image" and servicetype="spectrum" won't change their current behaviour (i.e., they'll return sia1 or ssap services, respectively), but they will result in a deprecation warning saying, for now, that people should use sia or ssap instead. That's for the next release.

(2) We create a new module for doing global dataset discovery that hides the various details of protocols and all that from the user. That's definitely material for the next release, and I can't promise I'll invest terribly much time into that, but I'm not convinced that's what we need if we want user-comprehensible global dataset discovery. I'd say more about it in a bug I'd open when I close this. If and when the new functions are available, the deprecation warnings would point to them.

What does everyone think?

bsipocz commented 1 year ago

I'm not super happy about this, there was also a reasonably widespread notion that end-users shouldn't know much about vo, e.g. know sia vs sia2. With this proposal, we actually force them to know the difference and whether a dataset is offered via one or the other (e.g. at Irsa we have a mixture of the two and we really should not expect the users to know which datasets served by either).

msdemlei commented 1 year ago

On Fri, May 19, 2023 at 09:33:15AM -0700, Brigitta Sipőcz wrote:

I'm not super happy about this, there was also a reasonably widespread button that end users shouldn't know much about vo, e.g. knowing sia vs sia2. With this proposal we actually force them to know the difference and whether a dataset is offered via one or the other (e.g at Irsa we have a mixture of the two and we really should not expect the users to know which datasets served by either).

No, on the contrary. The reason I propose to drop image is exactly because I don't believe the "shouldn't now much about" can fly based on any "slim" layer on top of the raw registry (i.e., whatever servicetype="image" could do).

To be a bit more concrete, I think what we should come up with to replace it (perhaps expanding its capabilities over time) is function of the type

global_find_images(spatial=None, spectral=None, temporal=None, maxrec=1000) -> [obscore record]

(where spatial would probably start out as a circle, and the other two as min,max pairs; we should probably quickly expand to support multiple positions). The trickery of picking apart all the complications to achieve halfway sane and predictable behaviour can IMHO only be encapsulated in a function abstracting away all of SIA, SIA2, Obscore and who knows what else (HiPS?). And if that were easy enough to hide within servicetype="image", I'd promise it for 1.5 :-).

andamian commented 1 year ago

I agree that servicetype="image" -> SIA1 is incorrect but I don't know what the appropriate solution is. global_find_images might be a bit too ambitious. There could be tens of services, do we search all of them even if they will return duplicate data?

For a data centric functionality, the workflow should probably be:

  1. Search for all image services and return a list of services
  2. (optional) User selects a subset of services. This can be based on the data provider or service type.
  3. Search for data in the service subset and return a list
  4. (optional) User filters the data list in a subset 5.. Download data in the subset (with potential SODA operations: cutouts, packaging etc)

servicetype can achieve this with the union option. It will then be up to the user to pick services from the result and use them accordingly and this is the part that a higher level class could be useful. So as a fix, I suggest to make it a union not only of SIAs but all the services that return that data type. That can be deprecated in the future when/if a viable alternative becomes available.

Related to this, I've noticed that the CADC SIA service has 2 access URLs (one for each version) but the registry returns only one of them. TOPCAT shows both. Is that a bug or me not knowing how to use the registry search?

msdemlei commented 1 year ago

On Wed, May 31, 2023 at 11:03:49PM -0700, Adrian wrote:

I agree that servicetype="image" -> SIA1 is incorrect but I don't know what the appropriate solution is. global_find_images might be a bit too ambitious. There could be tens of services, do we search all of them even if they will return duplicate data?

Ah, we'll be querying hundreds of services sometimes, and dealing with duplicate data as best we can is part of the challenge why I think "give me all image services" is nowhere near enough to let people do global image discovery.

On the other hand, global data discovery has been the VO's promise for a long time, and if we can't pull it off in pyvo, who else could?

For a data centric functionality, the workflow should probably be:

  1. Search for all image services and return a list of services

Actually, we should already use constraints at this level -- more and more services give their STC coverage, and we ought to use this information to cut down on the number of services to query. Users will quickly notice that search times go down dramatically if they're precise in their constraints, and this kind of thing may actually be the killer application that makes data providers give their STC coverage.

  1. (optional) User selects a subset of services. This can be based on the data provider or service type.

Well... if people know about the data provider, they can probably use the registry directly -- I don't think there's a strong case to be made to use some global_find_x with that kind of constraint. But who knows? Perhaps people will like the interface so much that we should support arbitrary rtcons constraints?

  1. Search for data in the service subset and return a list

Yup -- and there's the usual problems here: parallel queries, filtering dupes, when to give up on services not returning data for X seconds, etc. My plan would be to start slowly and do more advanced stuff when we find people actually like it.

5.. Download data in the subset (with potential SODA operations: cutouts, packaging etc)

In particular automatic cutouts would be nice, yes.

servicetype can achieve this with the union option. It will then be up to the user to pick services from the result and use them accordingly and this is the part that a higher level class could be useful. So as a fix, I suggest to make it a union not only of SIAs but all the services that return that data type. That can be deprecated in the future when/if a viable alternative becomes available.

As I said, I don't think we can make a service interface that reasonably encapsulates the intracies of the various service types, and I don't think we can reasonably explain how to do the service selection smartly using registry constraints to a normal astronomers (starting with: "Well, why don't all data providers declare their STC coverage?" and "Why do you have SIA2 when there's obscore?", which I couldn't honestly answer myself).

All-VO operations (which is what this is about) require non-trivial development, and I'm sure we can't ask astronomers to do that themselves.

Data discovery (i.e., "stuff without a servicetype constraint") are a different matter, and they have a completely different workflow, so that's fine; folks can look at what services are available on the data they've discovered, and there's no need to consider semantics of "image" or "sia".

Related to this, I've noticed that the CADC SIA service has 2 access URLs (one for each version) but the registry returns only one of them. TOPCAT shows both. Is that a bug or me not knowing how to use the registry search?

Where do you look? If you used record.access_url, you should have got a DeprecationWarning if there's multiple access URLs (i.e., interfaces) for a capability; and then pyvo just picks one of the multiple URIs. That's again legacy API we should get rid of (there's no such thing as the access URL of a resource). If you did something else: What was it?

trjaffe commented 1 year ago

FWIT, I agree with much of what Brigitta and Adrian said. I don't disagree with Markus' aims, but I don't like the current PR by itself. If we had an alternative workflow for users to find all-VO images, that would be different, but we don't yet. I would go for option 5 or 3 (duplication will always be with us) along with documenting what that means. I think astronomers can deal with the resulting duplication. If at some point, we have a better way, we can revisit this, but we do not as yet. This PR as implemented it seems to me will only add confusion.

msdemlei commented 1 year ago

On Tue, Jun 13, 2023 at 08:43:05AM -0700, trjaffe wrote:

FWIT, I agree with much of what Brigitta and Adrian said. I don't disagree with Markus' aims, but I don't like the current PR by itself. If we had an alternative workflow for users to find all-VO images, that would be different, but we don't yet. I would go for

Well, it's not like there is a working scheme for having all-VO cross-protocol discovery right now, so this PR won't damage anything working so far. This PR just stops claiming something that's been... factually incorrect (to avoid less polite words) ever since we defined Obscore, and which has become more factually incorrect with SIAv2.

option 5 or 3 (duplication will always be with us) along with documenting what that means. I think astronomers can deal with the resulting duplication. If at some point, we have a better way, we

The problem with this plan is that SIAService and SIA2Service have rather different signatures -- their query methods accept different parameters and have different return types. Just blindly returning them intermixed is going to be a very painful experience if someone really wants to do all-VO queries. That's why I was talking about the compatibility shim in the original bug, but building that again is painful and relatively complicated. I'd much rather invest the time to build such a thing into a proper all-VO dataset discovery facility that people can actually use.

Let's just write the global discovery functions -- help (in particular with writing the tests, which is going to be really painful as we'll have to mock half of the VO) is highly welcome.

The image and spectrum service types, however, appear less and less salvagable the longer I think about them...

can revisit this, but we do not as yet. This PR as implemented it seems to me will only add confusion.

Where would you expect confusion? You mean when people had servicetype='image' and now get warnings telling them to better use something that doesn't exist yet?

Well, yes, I'd be happier if that other thing existed, too, but I think it's actually a good thing to tell them that "image" isn't what they may think it is. I'm more than happy to consider better alternatives for the warning text, though. Perhaps: "Don't use image any more, because it's not the all-VO image search you may suspect." Better alternatives would be most appreciated.

bsipocz commented 1 year ago

Given we have a long discussion here, I move my comment from #449 to here:

I would not like to go ahead with this deprecation, until we have a good/convenient alternative. I totally see the problems, but OTOH, the users should not really need to know about the versions, so doing an 'image regsearch should really just return both SIAv1 and v2. Then they can be used individually (assuming the search() functionality is fixed for sia2).

E.g. I strongly feel instead of this, one should be able to do one image search, that returns the ~90 services.

import pyvo
image_services_v1 = pyvo.regsearch(servicetype='sia')
image_services_v2 = pyvo.regsearch(servicetype='sia2')
v2_irsa = [s for s in image_services_v2 if 'irsa' in s.ivoid]
v1_irsa = [s for s in image_services_v1 if 'irsa' in s.ivoid]

The user should do a lot of digging after that anyway to see e.g. which "Spitzer" they want, but otherwise all the entries look so similar, that they should be in one result.

In [27]: v1_irsa[0]
Out[27]: ('ivo://irsa.ipac/2mass/images/asky-at', 'vs:catalogservice', '2MASS ASKY AT', '2MASS All-Sky Atlas Image Service', 'research', 'This service provides access to and information about the 2MASS All-Sky Atlas Images. Atlas Images delivered by this service are in FITS format and contain full WCS information in their headers. Additionally, the image headers contain photometric zero point information. 2MASS Atlas Images are suitable for quantitative photometric measurements.', 'https://irsa.ipac.caltech.edu/applications/2MASS/IM', '', 'archive', '', '', np.float32(nan), 'infrared', ['https://irsa.ipac.caltech.edu/cgi-bin/2MASS/IM/nph-im_sia?type=at&ds=asky&'], ['ivo://ivoa.net/std/sia'], ['vs:paramhttp'], ['std'])

In [28]: v1_irsa[1]
Out[28]: ('ivo://irsa.ipac/2mass/images/asky-ql', 'vs:catalogservice', '2MASS QL', '2MASS All-Sky Quicklook Image Service', 'research', 'This service provides access to and information about the 2MASS All-Sky Quicklook Images. The Quicklook Images delivered by this service are restored from lossy-compressed files in FITS format with full WCS information contained in the image headers. These images are suitable for position measurements, finding charts and visual inspection of the near-infrared sky.', 'https://irsa.ipac.caltech.edu/applications/2MASS/IM', '', 'archive', '', '', np.float32(nan), 'infrared', ['https://irsa.ipac.caltech.edu/cgi-bin/2MASS/IM/nph-im_sia?type=ql&ds=asky&'], ['ivo://ivoa.net/std/sia'], ['vs:paramhttp'], ['std'])