Closed bsipocz closed 1 year ago
So... while I give you that I probably should not unconditionally catch ValueError-s n RegistryResource.search (but please note that I'd like to get rid of that method anyway; we should teach people to use get_service explicitly, as even now most records already have multiple capabilities), this is problem is unrelated to registry operations (phewy!).
On Thu, Jun 15, 2023 at 11:11:17PM +0000, Brigitta Sipőcz wrote:
import pyvo
from astropy.coordinates import SkyCoord coords = SkyCoord('150.01d 2.2d', frame='icrs')
image_services = pyvo.regsearch(servicetype='sia2') irsa_seip = [s for s in image_services if 'irsa' in s.ivoid and 'seip' in s.ivoid][0]
seip_results = irsa_seip.search(coords) [...] E10: None:3:15: E10: File does not appear to be a VOSICapabilities file
A shorter reproducer is:
import pyvo
svc = pyvo.dal.SIA2Service( "https://irsa.ipac.caltech.edu/SIA?COLLECTION=spitzer_seip&")
At that point we go down a rabbit hole (does someone reading this feel responsible for the SIA2 code? This would be the moment for the people who discarded by objections to SIA2 to come forward...).
SIA2Service, because of the way we had for a while planned to do authentication, iterates through the service's VOSI capabilities on construction (sia2.py#182 ff). It does not catch exceptions when doing that, although all that only become relevant with auth. When there's something wrong with the capabilities, the whole thing blows up.
For robustness with the overwhelming majority of SIA2 services (that don't require auth) I'd say there should be some exception catching (and probably downgrading to warnings) going on here. SIA2 itself works just perfectly well without ever looking at VOSI capabilities.
Regrettably, the rabbit hole goes deeper. Consider EndpointMixin in pyvo.dal.vosi, which has a _get_endpoint method trying the following URIs when trying to find VOSI endpoints:
'{baseurl}/{endpoint}'.format(baseurl=self.baseurl,
endpoint=endpoint),
url_sibling(self.baseurl, endpoint)
I'm impressed it's that diligent, I have to say (kudos to Adrian).
Anyway, in this case, where we have access_url-embedded URL parameters, that works out to
'https://irsa.ipac.caltech.edu/SIA?COLLECTION=spitzer_seip&/capabilities', 'https://irsa.ipac.caltech.edu/capabilities?COLLECTION=spitzer_seip&'
The first clearly is absurd. The second URI could, return the capabilities requested, but at least right now it's a 404, and I frankly would object to site-global capabilities somehow influenced by URL parameters (though I don't think we're forbidding that anyhwere).
The capabilities actually are at
https://irsa.ipac.caltech.edu/SIA/capabilities?COLLECTION=spitzer_seip&
which we might add as another guess, but... I have a bad feeling about that. There aren't any clear rules right now for how to find capabilities from an access URL outside of TAP, and
$ python -c "import this" | grep temptation In the face of ambiguity, refuse the temptation to guess.
So... here's what I think we should do:
(a1) downgrade the failure on bad capabilities to a warning, or, IMHO better,
(a2) remove the capabilities thing entriely from the constructor of SIA2Service if CADC (who I think are behind that experiment) agrees that auth negotiation won't happen like that any more.
(b) It would be wonderful if IRSA could stop relying on on built-in parameters in access URLs. They are trouble, although admittedly not outright illegal. And it's really nice if we can guess capability URIs by saying "try the capabilities sibling of the access URL" and it works.
(c) The "right" way to figure out capability URIs for us would be to run a RegTAP query along the lines of
select access_url from rr.capability natural join rr.interface
where standard_id='...vosi#availability'
and ivoid=
(I mention in passing that because for that particular service, IRSA does not register their VOSI capabilities, it still would have failed, but that then is really IRSA's fault).
We might put that RegTAP query into the VOSI mixin to use when all else fails, but at least it shouldn't be called by default, perhaps only when all other guesses have failed (and given that nothing but TAP really makes good use of VOSI caps, I wouldn't even bother with that).
For one, I'm not so wild about having one RegTAP query per pyVO service constructed worldwide. But perhaps more importantly, it would make an ivoid (and hence a registry entry) effectively mandatory for our service objects. While I'm a big fan of nudging people to register, I'd say that'd be going to far. Our service objects should keep on being constructable with just an access URL, no?
(I mention in passing that because for that particular service, IRSA does not register their VOSI capabilities, it still would have failed, but that then is really IRSA's fault).
@zoghbi-a sees the same issue for multiple heasarc services, so if multiple archives faulting on this, this maybe a shortcoming of the requirements. I don't know, haven't dived into the exact details but was surprised that things don't work (==> we need to make sure one of each service type is being tested in pyvo, knowing that something doesn't work is better than surprises I suppose)
I can confirm that this is present in many services. This an example code:
ivoid = 'ivo://astron.nl/__system__/siap2/sitewide'
srv = pyvo.regsearch(ivoid=ivoid)[0]
srv.get_service('sia2')
These are examples of services producing the same error
'ivo://astron.nl/__system__/siap2/sitewide'
'ivo://au.csiro/casda/sia2'
'ivo://nasa.heasarc/suzamaster'
...
For the irsa case, if I put the capabilities url by hand (https://irsa.ipac.caltech.edu/SIA/capabilities?COLLECTION=spitzer_seip&
) instead of the one being constructed from the url (https://irsa.ipac.caltech.edu/SIA?COLLECTION=spitzer_seip&/capabilities
), the issue disappears.
Do we know whether this is an issue for any other tools out there?
If the capabilities are in fact not required for a service to be registered (a very widespread nature of this issue suggests this, too), then it look to me that (a2) could be a way ahead or within pyvo to insert it into the URL as Adbu suggests above. Raising a user-facing warning at this point I don't think is useful. I'm sure archives would be willing to do to register the capabilities if that would be strictly required. If it's not a requirement, then OTOH, pyvo should just work ™️
On Fri, Jun 16, 2023 at 12:05:04PM -0700, Brigitta Sipőcz wrote:
Do we know whether this is an issue for any other tools out there?
Probably not. I think pyvo is afflicted mainly because it was a test bed for CADC's auth experiments.
If the capabilities are in fact not required for a service to be registered (a very widespread nature of this issue suggests this,
Ah, it's one of these silly pieces of standardese where people thought something should be required but then actually didn't need it, which is why in the end it doesn't even work. In this case, there is a formal requirement to have capabilities but no recipe just where it should be.
Given that:
too), then it look to me that (a2) could be a way ahead or within pyvo to insert it into the URL as Adbu suggests above.
Adrian, do we have green light for removing the capabilities experiment from SIA2 again? Or if you still need it, can we perhaps hide it from the general public by some @prototype decorator?
On Fri, Jun 16, 2023 at 11:54:47AM -0700, Abdu Zoghbi wrote:
I can confirm that this is present in many services. This an example code: ivoid = 'ivo://astron.nl/system/siap2/sitewide' srv = pyvo.regsearch(ivoid=ivoid)[0] srv.get_service('sia2')
Uh. That's an interesting one. For that, pyvo actually could find the right capability URI if it tried the
url_sibling(self.baseurl, endpoint)
from line 28 of pyvo.dal.vosi first. As things are, it tries
'{baseurl}/{endpoint}'
first, which in this case (and according to the old requirement of ending GET-table URIs with a question mark or an ampersand depending) works out to https://vo.astron.nl/__system__/siap2/sitewide/siap2.xml?/capabilities.
That's a SIA2 query, albeit a somewhat odd one. Since SIA2 I think does not say anything about rejecting unknown parameters, the underlying software (ahem) follows ancient (and bad) VO practice of just ignoring them. So, what comes back is a VOTable with a lot of matches, and parsing that as a capability document fails.
I would seem to me that regardless what we do with SIA2Service we need to re-think the capability URL guessing here (and perhaps talk to GWS and DAL to get a clear statement out of them how they think we should be doing this). Blindly attaching "/capabilities" to an access URL that will very often end in a question mark or even an ampersand certainly is not a good idea. Also, by DALI rules I'd strongly suggest we should try the sibling first.
cc @andamian
Sorry for the late reply.
It looks to me that irsa_seip.search()
is trying to instantiate the SIA2Service
with an access URL instead of a base URL. If that's the case, we could add an additional argument to the constructor to skip parsing the capabilities when that argument is present. Or, if we teach our users to get the access urls directly from the registries then we can ignore the capabilities
endpoint completely and deprecate baseulr
in the constructor in favour of access_url
. This model works fine for SIA2 (at least for now) but could be ambiguous for services with multiple endpoints.
As for the authentication, there was a time when we, at the CADC, had to have a different access points for Basic Authentication and the rest of auth methods. It's not the case anymore since we've stopped supporting that which means that direct access URLs should be more manageable now. As a side note, the entire pyvo.auth
module is right now based on the capabilities
endpoint as it aims at mapping the supported authentication mechanisms to the appropriate access URLs. It's probably due for an update although there no agreed solution right now.
On Mon, Jun 26, 2023 at 11:59:54AM -0700, Adrian wrote:
It looks to me that
irsa_seip.search()
is trying to instantiate theSIA2Service
with an access URL instead of a base URL. If that's the case, we could add an additional argument to the constructor to skip parsing the capabilities when that argument is present. Or, if we teach our users to get the access urls directly from the registries then we can ignore thecapabilities
endpoint
The access URL does come directly from the registry here -- and clearly everyone registers "the" {query} endpoint's URL as access URL (I frankly didn't even pause to think about it, but that may be professional blindness -- it's simply the say all the other S-services work).
completely and deprecate
baseulr
in the constructor in favour ofaccess_url
. This model works fine for SIA2 (at least for now) but could be ambiguous for services with multiple endpoints.
Ah. I frankly hadn't noticed that SIA2Service wants to be constructed with a base URL, either (see above on professional blindness)...
Again, all the other XService-s are constructed with what's in the Registry as access_url (which in TAP happens to be a, well, base URL). It wodld hence be lovely if I didn't have to special-case SIA2, and so I'd really like a VanillaSIA2Service (say) class that does take an access URL in the constructor for registry use.
Since Sect 2. of SIA2 says:
All {query} resources must be siblings of the VOSI-capabilities resource; this limitation enables a client with just the URL for an SIA {query} resource (e.g. from a Datalink service descriptor) to find the VOSI-capabilities...
VanillaSIA2Service could just strip a segment of the access URL and then up-call to the current SIA2Service, but due to the vagaries of endpoint guessing I'd rather not do that.
Instead, I think the ideal implementation would be to have VanillaSIA2Service to accept the access_url and not touch capabilities, wheres SIA2Service keeps its current interface but becomes basically a factory that parses the capabilities to find the/a access URL and then constructs VanillaSIA2Service.
be more manageable now. As a side note, the entire
pyvo.auth
module is right now based on thecapabilities
endpoint as it aims at mapping the supported authentication mechanisms to the appropriate access URLs. It's probably due for an update although there no agreed solution right now.
Yeah -- I think this whole thing is another indication that it's nice if we can do without mandatory parsing of capabilities... Aw, the cursed auth thing...
@andamian - do you have time on the fix for this or shall I? I would really like to have a new bugfix version out by the end of the month that fixes all these smallish SIA2 related bugs (basically all the bugs that we found except the 'image' regsearch, though if we have a solution/workaround in for that, too, that would be the most wonderful)
OK, so this is the only blocker remaining for 1.4.2, so I added the milestone.
On Wed, Aug 02, 2023 at 03:57:50PM -0700, Brigitta Sipőcz wrote:
OK, so this is the only blocker remaining for 1.4.2, so I added the milestone.
Slightly related to this SIA2 business is also https://github.com/astropy/pyvo/pull/449 -- and I think we should really tell people that the shouldn't use image and spectrum as servicetypes before there's more code using that.
Since it won't break anything: Can't it go into 1.4.2, too?
This bug is when one does a 'sia2'
servicetype registry search. So it's more than annoyance, basically renders the registry unusable for SIA2.
As for the for the 'image'
and 'spectrum'
deprecation, that should never go in a bugfix release.
I see where the suggestion is coming from, but as I see most people think they are needed, users are not expected to use ssa
, sia
, etc when they search, and until there is a clearly good alternative IMO we shouldn't deprecate. Besides, we have a lot of edu material that use the aliases so first those needs to be updated, too.
On Thu, Aug 03, 2023 at 04:54:46PM -0700, Brigitta Sipőcz wrote:
As for the for the
'image'
and'spectrum'
deprecation, that should never go in a bugfix release. I see where the suggestion is coming from, but as I see most people think they are needed, users are not expected to usessa
,sia
, etc when they search, and until there is a clearly good alternative IMO we shouldn't deprecate. Besides, we have a lot of edu material that use the aliases so first those needs to be updated, too.
The problem is: they don't do what people (can reasonably) expect them to do, and I don't see a (reasonable, let alone backwards-compatible) way to make them do what people can expect them to do. They are just misfeatures introduced before we realised what the introduction of obscore actually means -- and before we had experience with a major version step.
Would you be less disinclined to merge the PR if we didn't raise the Warnings but just deprecated in prose? You see, it'd be great if we at least did something to not accumulate more code out there that uses these misfeatures. And sure, it'd be great if we could update educational material that's out there to not show them any more – can you point me at some so I can start bugging authors?
(and apologies for hijacking this issue; I suppose we should continue this discussion over at #449)
I got bogged down in the weeds of trying to fix this, but at this point it's not worth using is as a blocker for the other fixes. So, I'll go ahead and release 1.4.2, and bump this into the next bugfix milestone.
I ran into this again, and I feel we should really fix it; it's an embarrassment if SIA2 discovery doesn't work. And since what's registered is access_url-s rather than TAP-like "base URLs", we'll need a constructor accepting these.
@andamian, do you have a plan for that? Or should I come up with a PR myself?
I second that this should be fixed sooner rather than later as in fact the IRSA SIA is not at all usable from the registry atm.
I tried to come up with a PR myself, but it would have required to nuke a lot of the current code and I didn't want to do that without input from Adrian.
I'm sorry, I've totally dropped the ball on this. Unfortunately, I'm swamped with other work at the moment and don't have time to do this PR right now - maybe by next week.
@msdemlei solution VanillaSIA2Service
should work but my preference is to make the SIA2Service
class more resilient to that URL. If the spec requires query and capabilities end points to be siblings, it could be easy to search for capabilities
either in the current path or the parent to figure out the context.
On Wed, Oct 25, 2023 at 03:34:12PM -0700, Adrian wrote:
that URL. If the spec requires query and capabilities end points to be siblings, it could be easy to search for
capabilities
either in the current path or the parent to figure out the context.
Mhh, I'm still staunchly in favour of not attempting to parse capabilities unless we really need them; it's just one extra thing that can break. On top, since the capabilities endpoint generally has no function for SIA2, it will break often (Markus' Interoperability Rule 20: "If it's not used by clients, it's broken").
The way our current SIA2Service works, that's not possible because the query endpoint's name is unpredictable. So, I really think we should have a separate service object with all the actual query functionality that, in analogy with the other S-services, is constructed just with the access URL (consistency is always nice). The current SIA2Service would then just be a factory function fiddling out the access url from the capabilities and then constructing the "plain" service object.
The NormalSIA2Service then wouldn't have to bother with capabilities at all, which I think is rather gratifying.
Oh, and: does CADC plan on using capabilities for auth negotiation in the future?
I've had a quick look at the code and I'm trying to determine exactly what the issue is here. The SIA2Service
class states that it can take a base url or an access point url. Both SIA2Service('https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/sia/v2query')
and SIA2Service('https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/sia')
work for me. In fact, I can put any sibling (such as the SIA query end point) and it still works. It doesn't work with v2query?
but that could be fixed if we consider that access URLs ending in ?
are valid.
@msdemlei , if I understand correctly, what you are proposing is a class that ignores capabilities. But the capabilities
end point is required in SIA2 so to provide code that circumvents our own standards doesn't seem right to me. If you think that capabilities
have no use in SIA2 (like it's the case I think for DataLink
) then the correct way to address this is to propose a change of the standard. My opinion is that this code should follow the standards.
The capabilities
end point is actually used at the CADC. We have one SIA
service and the capabilities
advertise the access urls for V1 and V2 of the standard. We still advertise the supported auth methods there since there's currently no viable alternative.
This is my rather limited view from the service perspective. I'm aware that I lack any expertise with the registry and don't know how things look from there. I hope to fill in some gaps at the registry sessions at the upcoming Interop.
On Thu, Oct 26, 2023 at 10:29:01AM -0700, Adrian wrote:
@msdemlei , if I understand correctly, what you are proposing is a class that ignores capabilities. But the
capabilities
end point is required in SIA2 so to provide code that circumvents our own standards doesn't seem right to me. If you think that
Well... It's just that in the normal workflow, SIA2 has no need to look at capabilities, and so I'd prefer to leave it alone, if only in the spirit of the golden rule of interoperability ("be strict in what you produce but lenient in what you expect"; we're not a validator, after all, and the SIA2 validators that exist don't look at capabilities either (yet) for all I know, so we have to expect them to be broken in general for SIA2). That we save an HTTP request if we don't use the capabilities is a welcome side effect.
capabilities
have no use in SIA2 (like it's the case I think forDataLink
) then the correct way to address this is to propose a
Well, that's another can of worms that nobody really wants to open; see http://ivoa.net/documents/caproles/ -- and that's another reason not to rely on capabilities without a strong reason.
change of the standard. My opinion is that this code should follow the standards.
Sure; but "following a standard" doesn't mean we should require more of services than we need to get our job done.
The
capabilities
end point is actually used at the CADC. We have oneSIA
service and thecapabilities
advertise the access urls for V1 and V2 of the standard. We still advertise the supported auth methods there since there's currently no viable alternative.
Well, SIA2Service doesn't need the V1/V2 dispatch; with VO integration, you make that choice at the discovery stage, and by the time SIA2Service is created, that choice is over.
So... what's the workflow on auth integration? Can we perhaps only look at the capabilities if people have passed in some credentials? Or when the inquire whether auth is possible and if so, which?
I could be persuaded to make parsing of capabilities
optional in SIA2Service
class, i.e. if the provided URL is a sibling of capabilities
, just assume is the correct query end point but making a separate class for that adds unnecessary confusion if this is part of the public API. Users will need to know which class to instantiate when the only difference is a small implementation detail.
At the moment, the workflow with auth integration could check the user credentials in the sessions
against the ones associated with the query end point in capabilities and potentially provide a better feedback than the simple brute force 403 error. The big problem for me here is that oftentimes the credentials are obtain to works for specific authorities/domains. How can the client find out that and prevent leaks of credentials to other domains? Not such a big problem with certificates but with bearer tokens, which essentially act as short term user/passwords.
On Thu, Oct 26, 2023 at 12:47:00PM -0700, Adrian wrote:
I could be persuaded to make parsing of
capabilities
optional inSIA2Service
class, i.e. if the provided URL is a sibling ofcapabilities
, just assume is the correct query end point but
In the scenario relevant here, it's usually what the data providers gave in their registry record; if that's wrong, there is nothing we can fix (and it would stand to reason that what you get from capabilities -- essentially, an excerpt of the registry record -- is then wrong, too).
making a separate class for that adds unnecessary confusion if this is part of the public API. Users will need to know which class to instantiate when the only difference is a small implementation detail.
Yes, I give you two classes are not pretty, but then some magic heuristically probing whether what is passed in is the parent of both capabilities and {query} or whether it's the normal access URL IMHO is worse. And it's not pretty either that SIA2Service works differently from all the other S*Service classes (that are all constructed with access_url).
So, I think there's no way around some sort of uglyness.
What about a classmethod from_access_url on SIA2Service with an explanation that that's what you ought to use when you want to work SIA2Service to work like the other S*Service classes? This could then switch off capabilities parsing (at least until we have a case that needs it or until there's an interoperable auth standard that needs it).
Breaking the current behaviour and making the argument an access_url by default is probably not an option, is it?
to works for specific authorities/domains. How can the client find out that and prevent leaks of credentials to other domains? Not such a big problem with certificates but with bearer tokens, which essentially act as short term user/passwords.
I realise auth is behind the current design, but if we can separate the API question from auth considerations as much as possible, it certainly would help a lot to solve it. Auth has been an unsolved problem for so many years, and I fear if we delay fixing SIA2Service until we've figured out auth will keep SIA2Service in its broken (for my -- registry-related -- purposes) state for a long, long time.
It was confirmed that this misbehaviour shows up in several heasarch SIA2 services so it's not an irsa related problem
If this is of any help