Open todor-ivanov opened 1 year ago
@todor-ivanov , thanks for wrapping this bug into specific issue. I just want to add here that we should be very careful with changes to auth/authz logic as it is used in different places. For instance, dbs2go and auth proxy go code relies on cmsauth repo which implements similar logic for Go based applications (APS/XPS, CMSMonitoring, DBS, etc.). Therefore, I suggest to elevate this issue to all relevant parties, and include CMSWEB/CMSMonitoring/DBS groups to take similar set of actions and provide similar changes and deploy newer version with that fixes.
I opened new issue in cmsauth repo, see https://github.com/dmwm/cmsauth/issues/1
FYI @muhammadimranfarooqi , @mrceyhun , @arooshap, @Paparrigopoulo
Thanks @vkuznet
For the additional issue. Absolutely agree with you. Just need to add something related to the name:
Adjust auth/authz module to support new headers
.
It is not just a new header that is added here. It is a new type of object also associated with the user. I checked the code further and it would require not only changes in the information we collect from CRIC through new headers. This change in CRIC will also imply new attributes to the user object && changes of all methods signatures and logic related to users' authentication/authorization. Yeee... what I wanted to say is - at the end it manifests to us in the form of a new HTTP header, but the story does not end just there ...
Valya I just updated the issue description, adding a twiki page related to the CMS IAM-Based Tokens, which Panos provided for reference.
AFAIU Panos did not expect, nor desired, the new group-type to appear as a header for the CMSWEB BackEnd services. So I encourage you to talk with him and decide if this really need to be supported as an additional authentication piece (as in your last comment) or whether WMCore should simply ignore this and possibly any other header which is not really interested on. I am not sure that we need to be strict in validating all the info that we get from CMSWEB FrontEnd, vs. using the pieces of it which are relevant for a give BackEnd.
That said, I have no strong feeling or requirement and "as long as it works" leave the matter to you.
Thanks @belforte, this is my intention as a first step as well. I just added a quick fix that is supposed to just obtain the proper information provided by the new header, but I am not going to check it anywhere down the code for the time being. I hope that would at least help me patch the current services in production.
Hi @panos512 ,
I know we have briefly discussed this new type of group in MM and the relevant Jira ticket, but as Alan mentioned in his comment here, it would be good to know what are the actual plans for using this new tag/group type. Like:
The reason for asking is we'd want to know how feasible is for us to investigate on full feature development right now, or should we postpone things before everything is clear. A complete list of requirements would also be crucial for us for the process of fixing this.
FYI @amaltaro
Hello @todor-ivanov and @amaltaro
First of all, I have patched the CRIC API that caused the issue [1] to only display groups with the group
and site
tags.
This means that the output will be in the expected format whatever new groups we add in CRIC.
Of course, it's very easy to enable any other tags in the future. So if we let's say decide that we wantiam-group
tagged groups to propagate through this API we can of course do it.
Now for the migration to tokens. The changes we are doing are so that the we sync some of the CRIC groups with IAM (cms-auth.web.cern.ch). Essentialy, we want to maintain user-groups in CRIC, tagged with the iam_group tag, that will queried by a script and be put in IAM (more details can be found in the twiki Todor posted above [2]). I can happily share more details, we also have a talk scheduled at the O&C week on Friday morning about this topic.
There are also some slides [3] that I presented to my group but the audience was different, maybe just check slides 24-34 if you are interested.
So, these groups are most probably of no interest to WMCore.
They will be put in IAM and special scope policies will be assigned to the groups.
e.g. the /cms/storage/pp_prod group will be assigned a scope policy that will give all the tokens of its members the storage.write scope for the /store/data/*
area
The /cms/compute/scope
group will give the compute.create
policy to be able to create jobs etc
This means that in any client level when a token arrives one can check the scopes it has assigned and decide what to do with it.
We want to have this sync mechanism running in the next couple of weeks so that we start testing other services. I initially joined this effort to take care of the data management part so I know that we need to start testing that storage services correctly interpret tokens right after. I'm not sure what the plans of other areas are.
Hope this helps :)
[1] https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=roles [2] https://twiki.cern.ch/twiki/bin/viewauth/CMS/IAMTokens [3] https://docs.google.com/presentation/d/1g8pnzLJ5gNeFLbwcWUVOaL3aMu8cGEI-Do1cfHsIaF0/edit?usp=sharing
Thanks @Panos512 ,
This indeed outlines the big picture, and plus, provides the sources and all details needed for people who want to dive in deeper. It was also good to see your plans and where things are heading at the moment. And also thanks for hiding the new group tag for WMCore and other services using the legacy APIs.
So to resume (please correct me if I am wrong) - In the current situation, those activities and tests would be accomplished mostly by data and grid resource management groups and eventually some other teams who are willing to jump early. But as it comes to WMCore, we are in a safe position to sit and decide how to approach it. Given that your patch would not block others for the time being, there is no urgency on us for making any changes and new code (re)designs in terms of authentication in WMCore.
All that said, I am now in favor of decreasing the priority level of this issue to a more bearable one rather than the Highest
possible and let it be open. That issue could be used to start and track a discussion inside WMCore to double check all the changes needed (and also to asses how much of the work has already been done in past activities related to token based authentication), such that we can eventually start utilizing this new iam-group
tag. And I am pretty sure @amaltaro would pretty much agree with me here.
FYI: @klannon, @vkuznet, @khurtado
Thank for these details, @todor-ivanov @Panos512.
Yes Todor. Let's demote it for now and tackle this issue whenever tokens come to our planned quarterly activities. I just changed a few tags. You might want to unassigned yourself and move it back to the Todo list as well (and away from Q1).
@Panos512 , in order to properly resolve this issue I need to know from you how CRIC data-format will evolve, what is its schema, etc. In particular, I need to know where in user record of CRIC the iam_group
appeared. For instance from current JSON I deduce the following:
{
"DN": "dn_string",
"EMAIL": "user@mail.com",
"ID": <INT>,
"LOGIN": "login_name",
"NAME": "First Last names",
"ROLES": {
"admin": [
"group:xxx",
"group:yyy"
],
"another_role": [
"group:xxx",
"group:yyy"
]
}
},
So, the two questions I need to know:
iam_group
appear in such user record, and how it is represented in schema. It would be useful to provide specific example of such recordOnce you'll provide this info I can adjust code accordingly to handle new attributes in CRIC records.
Hi @vkuznet absolutely, let me start by explaining how our schema looks like and what was the iam_group
that appeared. I can then explain what we expose through the legacy roles API (the one that you mention in your comment)
In general, inside CRIC a user group can have a GroupTag
and a Role
attached to it.
Here is the schema of the two:
class GroupTag:
name = models.CharField(max_length=64, help_text='Unique name of tag. A Tag is used to aggregate various groups together', verbose_name='Tag name')
rel = models.CharField(max_length=64, verbose_name='relation', help_text='Type of tag (relation value), e.g. "group" or "facility"', blank=True, default='group')
title = models.CharField(max_length=128, help_text='Optional title value for a tag', blank=True, default='')
class Role:
name = models.CharField(max_length=64, unique=True, help_text='Unique identifier', verbose_name='Role name')
title = models.CharField(max_length=128, help_text='Custom title for a role', blank=True, default='', verbose_name='Role title')
The new iam_group
that appeared was a new GroupTag for groups that will be synced with IAM.
You can see one of them exposed through our main API here https://cms-cric.cern.ch/api/accounts/group/query/?json&name=CMS_IAM_pp_prod
The problem came because I have forgotten how we expose information thought the legacy Roles API preset that was developed to offer backwards compatibility with SiteDB when we migrated few years ago.
In the roles API we expose the list of cms-cric users, adding the ROLES attribute. This attribute contains for each group of the user the following information:
"ROLES": {
"Group.role:name": [
"Group.tag.rel": "Group.tag.name"
]
}
On top of that, for all groups with the facility
tag we artificially explode them to the corresponding sites.
e.g. if a facility has SiteA SiteB and SiteC for each facility
tagged group we will expose three entries with site:SiteA, site:SiteB and site:SiteC
When the groups with the new iam_group
tag appeared we had new entries in the roles API, such as iam_group:groupA, iam_group:groupB
etc.
Since the clients expected only site
and group
groups, they broke with iam_group
.
In order to avoid future problems I have modified all those legacy APIs to only expose groups with the group
tag and the exploded site
groups coming from groups with the facility
tag. New tags and roles shouldn't go propagate through those legcay APIs.
As for schema evolution there are no plans for it to evolve any time soon. The problem was that in the roles API we expose values of the schema as schema attributes (by using the role name and the tag rel as keys). So when I thought I was adding a new value I was technically introducing new attributes in the roles API.
I hope this helps, I can share more details/clarifications if needed
Panos, thanks for details. I would like to make few comments:
.
, or especially colon :
since the last one is certainly used as separator in different code bases. Therefore, I would expect the structure to be:
"ROLES": {
"GroupRole:name": [
"GroupTagRel": "GroupTagName"
]
}
iam_group
, e.g.
"ROLES": {
"user": [
"group": "dbs",
"iam_group": "test"
]
}
I just need concrete example, especially what you added which caused the issue (I think the structure like this).
@vkuznet Valentin, you probably missed some of the progress already made here.
Todor started working on it and provided this candidate fix: https://github.com/dmwm/WMCore/pull/11496
but given that Panos "fixed" this new group on the CRIC side, we deciced to wait for a final decision on how the iam_group will be deployed in CRIC, before proceeding with any changes in WMCore.
Since then, this issue has been demoted and will be tackled in the future.
Alan, I did not miss anything and saw Todor's patch. I would like to provide a fix which will avoid patching the code every time something will be added in CRIC. Once I'll provide new PR it will incorporate Todor's fix.
This is no longer a priority for this quarter. If you have a free slot, please reach out to me on Slack and we can work on something that is originally planned for this quarter.
Apologies for missing the "Urgent Ops" label, that should have been removed when Todor demoted it.
Impact of the bug All services.
Describe the bug We were contacted to check an
HTTP error 500
coming from our services related to a single DN. While checking the logs I managed to confirm that certain DNs do trigger an exception in our authentication code at the REST module. [1] (user related information has been stripped off for security reasons).Upon further debugging the root of the problem has been identified, and even the time of the OPS activities causing it matched perfectly with our observations.
In WMCore there is this function, which checks all groups and roles associated to a user in order to authenticate him: https://github.com/dmwm/WMCore/blob/747c70a8a6508765f460073ba9d288f9e3d97d77/src/python/WMCore/REST/Auth.py#L127
This function lives in the REST module and does these checks by iterating through all HTTP headers of the request, both:
In one of those headers added by the FE we now see an unknown type of group:
iam_group
. Checking with @drkovalskyi and @Panos512 it became clear that the new type/category of groups has been added to CRIC yesterday related to some new development for enabling token based authentication. And only a particular set of users have been associated with it. The error we observe is simply because we know nothing about this new category of role/group. If it was just a new group... that wouldn't have broken anything, but being it a completely new type, then few lines of code need to be added on our side.Here is the Jira Ticket associated with this: [2]
And here is a really important twiki page explaining what it is all about [3]
How to reproduce it Anybody who has this new type of group/role associated with his account in CRIC would receive
Internel server error HTTP 500
when trying to connect to WMCore services, regardless of the client or the method of authentication he uses (so far x509 only).Expected behavior Not to have the service broken because of a change in CRIC.
Additional context and error message [1]
[2] https://its.cern.ch/jira/browse/CMSKUBERNETES-220
[3] https://twiki.cern.ch/twiki/bin/viewauth/CMS/IAMTokens