Broken authentication for users in the REST module upon adding a new type of groups in CRIC

todor-ivanov commented 1 year ago

Impact of the bug All services.

Describe the bug We were contacted to check an HTTP error 500 coming from our services related to a single DN. While checking the logs I managed to confirm that certain DNs do trigger an exception in our authentication code at the REST module. [1] (user related information has been stripped off for security reasons).

Upon further debugging the root of the problem has been identified, and even the time of the OPS activities causing it matched perfectly with our observations.

In WMCore there is this function, which checks all groups and roles associated to a user in order to authenticate him: https://github.com/dmwm/WMCore/blob/747c70a8a6508765f460073ba9d288f9e3d97d77/src/python/WMCore/REST/Auth.py#L127
This function lives in the REST module and does these checks by iterating through all HTTP headers of the request, both:
- those coming from the user's original request and
- those added by the Front End upon fetching the relevant user information from CRIC. https://github.com/dmwm/WMCore/blob/747c70a8a6508765f460073ba9d288f9e3d97d77/src/python/WMCore/REST/Auth.py#L17-L20

In one of those headers added by the FE we now see an unknown type of group: iam_group. Checking with @drkovalskyi and @Panos512 it became clear that the new type/category of groups has been added to CRIC yesterday related to some new development for enabling token based authentication. And only a particular set of users have been associated with it. The error we observe is simply because we know nothing about this new category of role/group. If it was just a new group... that wouldn't have broken anything, but being it a completely new type, then few lines of code need to be added on our side.

Here is the Jira Ticket associated with this: [2]

And here is a really important twiki page explaining what it is all about [3]

How to reproduce it Anybody who has this new type of group/role associated with his account in CRIC would receive Internel server error HTTP 500 when trying to connect to WMCore services, regardless of the client or the method of authentication he uses (so far x509 only).

Expected behavior Not to have the service broken because of a change in CRIC.

Additional context and error message [1]

[23/Feb/2023:00:24:17]  
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 117, in run_hooks
    hook()
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 65, in __call__
    return self.callback(**self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Auth.py", line 138, in authz_user
    cherrypy.request.user = user_info_from_headers(key, verbose)
  File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Auth.py", line 59, in user_info_from_headers
    user['roles'][hkname][site_or_group].add(name)
KeyError: 'iam_group'
[23/Feb/2023:00:24:17] HTTP 
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 688, in _do_respond
    self.hooks.run('before_request_body')
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 95, in run
    self.run_hooks(iter(sorted(self[point])))
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 117, in run_hooks
    hook()
  File "/usr/local/lib/python3.8/site-packages/cherrypy/_cprequest.py", line 65, in __call__
    return self.callback(**self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Auth.py", line 138, in authz_user
    cherrypy.request.user = user_info_from_headers(key, verbose)
  File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Auth.py", line 59, in user_info_from_headers
    user['roles'][hkname][site_or_group].add(name)
KeyError: 'iam_group'
[23/Feb/2023:00:24:17] HTTP 
Request Headers:
  Remote-Addr: 137.138.158.176
  HOST: cmsweb-k8s-prodsrv.cern.ch
  X-REQUEST-ID: 3cb59b9a52a286789efecda1345777e6
  X-REAL-IP: 188.185.101.116
  X-FORWARDED-FOR: 188.185.101.116
  X-FORWARDED-HOST: cmsweb-k8s-prodsrv.cern.ch
  X-FORWARDED-PORT: 80
  X-FORWARDED-PROTO: http
  X-FORWARDED-SCHEME: http
  X-SCHEME: http
  X-ORIGINAL-FORWARDED-FOR: 188.184.104.18
  USER-AGENT: PycURL/7.29.0
  ACCEPT: */*
  HTTPS: on
  CMS-REQUEST-URI: /t0_reqmon/data/requestcache
  CMS-AUTH-STATUS: OK
  CMS-AUTH-CERT: 
  ....
  X-FORWARDED-SERVER: localhost

[2] https://its.cern.ch/jira/browse/CMSKUBERNETES-220

[3] https://twiki.cern.ch/twiki/bin/viewauth/CMS/IAMTokens

vkuznet commented 1 year ago

@todor-ivanov , thanks for wrapping this bug into specific issue. I just want to add here that we should be very careful with changes to auth/authz logic as it is used in different places. For instance, dbs2go and auth proxy go code relies on cmsauth repo which implements similar logic for Go based applications (APS/XPS, CMSMonitoring, DBS, etc.). Therefore, I suggest to elevate this issue to all relevant parties, and include CMSWEB/CMSMonitoring/DBS groups to take similar set of actions and provide similar changes and deploy newer version with that fixes.

I opened new issue in cmsauth repo, see https://github.com/dmwm/cmsauth/issues/1

FYI @muhammadimranfarooqi , @mrceyhun , @arooshap, @Paparrigopoulo

todor-ivanov commented 1 year ago

Thanks @vkuznet
For the additional issue. Absolutely agree with you. Just need to add something related to the name: Adjust auth/authz module to support new headers.

It is not just a new header that is added here. It is a new type of object also associated with the user. I checked the code further and it would require not only changes in the information we collect from CRIC through new headers. This change in CRIC will also imply new attributes to the user object && changes of all methods signatures and logic related to users' authentication/authorization. Yeee... what I wanted to say is - at the end it manifests to us in the form of a new HTTP header, but the story does not end just there ...

todor-ivanov commented 1 year ago

Valya I just updated the issue description, adding a twiki page related to the CMS IAM-Based Tokens, which Panos provided for reference.

belforte commented 1 year ago

AFAIU Panos did not expect, nor desired, the new group-type to appear as a header for the CMSWEB BackEnd services. So I encourage you to talk with him and decide if this really need to be supported as an additional authentication piece (as in your last comment) or whether WMCore should simply ignore this and possibly any other header which is not really interested on. I am not sure that we need to be strict in validating all the info that we get from CMSWEB FrontEnd, vs. using the pieces of it which are relevant for a give BackEnd.

That said, I have no strong feeling or requirement and "as long as it works" leave the matter to you.

todor-ivanov commented 1 year ago

Thanks @belforte, this is my intention as a first step as well. I just added a quick fix that is supposed to just obtain the proper information provided by the new header, but I am not going to check it anywhere down the code for the time being. I hope that would at least help me patch the current services in production.

todor-ivanov commented 1 year ago

Hi @panos512 ,

I know we have briefly discussed this new type of group in MM and the relevant Jira ticket, but as Alan mentioned in his comment here, it would be good to know what are the actual plans for using this new tag/group type. Like:

Is this already final decision on how this token based authentication mechanism would work?
Would that be the ultimate tag name to be used - not changing in the future?
And eventually, what are the expected timelines for implementing this and putting in production?

The reason for asking is we'd want to know how feasible is for us to investigate on full feature development right now, or should we postpone things before everything is clear. A complete list of requirements would also be crucial for us for the process of fixing this.

FYI @amaltaro

Panos512 commented 1 year ago

Hello @todor-ivanov and @amaltaro

First of all, I have patched the CRIC API that caused the issue [1] to only display groups with the group and site tags. This means that the output will be in the expected format whatever new groups we add in CRIC.

Of course, it's very easy to enable any other tags in the future. So if we let's say decide that we wantiam-group tagged groups to propagate through this API we can of course do it.

Now for the migration to tokens. The changes we are doing are so that the we sync some of the CRIC groups with IAM (cms-auth.web.cern.ch). Essentialy, we want to maintain user-groups in CRIC, tagged with the iam_group tag, that will queried by a script and be put in IAM (more details can be found in the twiki Todor posted above [2]). I can happily share more details, we also have a talk scheduled at the O&C week on Friday morning about this topic.

There are also some slides [3] that I presented to my group but the audience was different, maybe just check slides 24-34 if you are interested.

So, these groups are most probably of no interest to WMCore. They will be put in IAM and special scope policies will be assigned to the groups. e.g. the /cms/storage/pp_prod group will be assigned a scope policy that will give all the tokens of its members the storage.write scope for the /store/data/* area The /cms/compute/scope group will give the compute.create policy to be able to create jobs etc

This means that in any client level when a token arrives one can check the scopes it has assigned and decide what to do with it.

We want to have this sync mechanism running in the next couple of weeks so that we start testing other services. I initially joined this effort to take care of the data management part so I know that we need to start testing that storage services correctly interpret tokens right after. I'm not sure what the plans of other areas are.

Hope this helps :)

[1] https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=roles [2] https://twiki.cern.ch/twiki/bin/viewauth/CMS/IAMTokens [3] https://docs.google.com/presentation/d/1g8pnzLJ5gNeFLbwcWUVOaL3aMu8cGEI-Do1cfHsIaF0/edit?usp=sharing

todor-ivanov commented 1 year ago

Thanks @Panos512 ,

This indeed outlines the big picture, and plus, provides the sources and all details needed for people who want to dive in deeper. It was also good to see your plans and where things are heading at the moment. And also thanks for hiding the new group tag for WMCore and other services using the legacy APIs.

So to resume (please correct me if I am wrong) - In the current situation, those activities and tests would be accomplished mostly by data and grid resource management groups and eventually some other teams who are willing to jump early. But as it comes to WMCore, we are in a safe position to sit and decide how to approach it. Given that your patch would not block others for the time being, there is no urgency on us for making any changes and new code (re)designs in terms of authentication in WMCore.

All that said, I am now in favor of decreasing the priority level of this issue to a more bearable one rather than the Highest possible and let it be open. That issue could be used to start and track a discussion inside WMCore to double check all the changes needed (and also to asses how much of the work has already been done in past activities related to token based authentication), such that we can eventually start utilizing this new iam-group tag. And I am pretty sure @amaltaro would pretty much agree with me here.

FYI: @klannon, @vkuznet, @khurtado

amaltaro commented 1 year ago

Thank for these details, @todor-ivanov @Panos512.

Yes Todor. Let's demote it for now and tackle this issue whenever tokens come to our planned quarterly activities. I just changed a few tags. You might want to unassigned yourself and move it back to the Todo list as well (and away from Q1).

vkuznet commented 1 year ago

@Panos512 , in order to properly resolve this issue I need to know from you how CRIC data-format will evolve, what is its schema, etc. In particular, I need to know where in user record of CRIC the iam_group appeared. For instance from current JSON I deduce the following:

  {
    "DN": "dn_string",
    "EMAIL": "user@mail.com",
    "ID": <INT>,
    "LOGIN": "login_name",
    "NAME": "First Last names",
    "ROLES": {
      "admin": [
        "group:xxx",
        "group:yyy"
      ],
      "another_role": [
        "group:xxx",
        "group:yyy"
      ]
    }
  },

So, the two questions I need to know:

where the iam_group appear in such user record, and how it is represented in schema. It would be useful to provide specific example of such record
how in a future the CRIC schema will evolve, i.e. in particular with respect to IAM attributes. Please provide a schema for that with examples.

Once you'll provide this info I can adjust code accordingly to handle new attributes in CRIC records.

Panos512 commented 1 year ago

Hi @vkuznet absolutely, let me start by explaining how our schema looks like and what was the iam_group that appeared. I can then explain what we expose through the legacy roles API (the one that you mention in your comment)

In general, inside CRIC a user group can have a GroupTag and a Role attached to it.

Here is the schema of the two:

class GroupTag:
    name = models.CharField(max_length=64, help_text='Unique name of tag. A Tag is used to aggregate various groups together', verbose_name='Tag name')
    rel = models.CharField(max_length=64, verbose_name='relation', help_text='Type of tag (relation value), e.g. "group" or "facility"', blank=True, default='group')
    title = models.CharField(max_length=128, help_text='Optional title value for a tag', blank=True, default='')

class Role:
    name = models.CharField(max_length=64, unique=True, help_text='Unique identifier', verbose_name='Role name')
    title = models.CharField(max_length=128, help_text='Custom title for a role', blank=True, default='', verbose_name='Role title')

The new iam_group that appeared was a new GroupTag for groups that will be synced with IAM. You can see one of them exposed through our main API here https://cms-cric.cern.ch/api/accounts/group/query/?json&name=CMS_IAM_pp_prod

The problem came because I have forgotten how we expose information thought the legacy Roles API preset that was developed to offer backwards compatibility with SiteDB when we migrated few years ago.

In the roles API we expose the list of cms-cric users, adding the ROLES attribute. This attribute contains for each group of the user the following information:

"ROLES": {
   "Group.role:name": [
      "Group.tag.rel": "Group.tag.name"
   ]
}

On top of that, for all groups with the facility tag we artificially explode them to the corresponding sites. e.g. if a facility has SiteA SiteB and SiteC for each facility tagged group we will expose three entries with site:SiteA, site:SiteB and site:SiteC

When the groups with the new iam_group tag appeared we had new entries in the roles API, such as iam_group:groupA, iam_group:groupB etc. Since the clients expected only site and group groups, they broke with iam_group.

In order to avoid future problems I have modified all those legacy APIs to only expose groups with the group tag and the exploded site groups coming from groups with the facility tag. New tags and roles shouldn't go propagate through those legcay APIs.

As for schema evolution there are no plans for it to evolve any time soon. The problem was that in the roles API we expose values of the schema as schema attributes (by using the role name and the tag rel as keys). So when I thought I was adding a new value I was technically introducing new attributes in the roles API.

I hope this helps, I can share more details/clarifications if needed

vkuznet commented 1 year ago

Panos, thanks for details. I would like to make few comments:

You should never use special symbols, like dot ., or especially colon : since the last one is certainly used as separator in different code bases. Therefore, I would expect the structure to be:
```
"ROLES": {
"GroupRole:name": [
  "GroupTagRel": "GroupTagName"
]
}
```
please provide example of CRIC schema entries which had iam_group, e.g.
```
"ROLES": {
"user": [
  "group": "dbs",
  "iam_group": "test"
]
}
```
I just need concrete example, especially what you added which caused the issue (I think the structure like this).

amaltaro commented 1 year ago

@vkuznet Valentin, you probably missed some of the progress already made here.

Todor started working on it and provided this candidate fix: https://github.com/dmwm/WMCore/pull/11496

but given that Panos "fixed" this new group on the CRIC side, we deciced to wait for a final decision on how the iam_group will be deployed in CRIC, before proceeding with any changes in WMCore.

Since then, this issue has been demoted and will be tackled in the future.

vkuznet commented 1 year ago

Alan, I did not miss anything and saw Todor's patch. I would like to provide a fix which will avoid patching the code every time something will be added in CRIC. Once I'll provide new PR it will incorporate Todor's fix.

amaltaro commented 1 year ago

This is no longer a priority for this quarter. If you have a free slot, please reach out to me on Slack and we can work on something that is originally planned for this quarter.

Apologies for missing the "Urgent Ops" label, that should have been removed when Todor demoted it.

dmwm / WMCore

Broken authentication for users in the REST module upon adding a new type of groups in CRIC #11495