dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Add TokenManagement solution to WMAgents #11199

Open vkuznet opened 2 years ago

vkuznet commented 2 years ago

Impact of the new feature In order to start switching to token based authentication we need to decide and setup token management solution.

Is your feature request related to a problem? Please describe. Currently there are multiple solutions exists:

Describe the solution you'd like Decide which tool to use and adopt it in cronjob for WMAgent. For that we need:

Describe alternatives you've considered

Additional context

10118 , #10939

amaltaro commented 2 years ago

Valentin, my preference would be to actually implement the WMAgent token management in the AgentStatusWatcher component. That means, the component would be responsible for:

There is one drawback here though, if tokens have a very short lifetime, then ensuring that this component is always up & running might become a problem. This includes a possible node crash/reboot where condor jobs would be recreated while the agent is down...

vkuznet commented 2 years ago

Alan, I doubt it is a good idea. The token management should be independent from WMA framework/tools. I don't see any benefits of re-inventing the wheel. I pointed out to three solutions which are independent from WMA/WMCore tools and I don't see any benefits to incorporate yet another solution to WMA/WMCore stack.

belforte commented 1 year ago

In the meeting today Brian noted that keeping refreshed token for HTCondor jobs is a well understood process at FNAL. No need to reinvent. IIUC this means "talk to Farrukh to know more"

amaltaro commented 1 year ago

As discussed in today's WMCore team meeting, we decided to promote this issue to High priority this quarter, while https://github.com/dmwm/WMCore/issues/11728 is getting demoted to Medium priority (this one had been originally considered for Q4).

amaltaro commented 11 months ago

Just a brief update on this issue. I have been working on this with Stephan L. and HyunWoo from FNAL, running a few tests in submit1 and making a few changes here and there. We can see the ScitokensFile variable in the grid runtime environment, but we still need to ensure that the token: a) is continuously updated on the schedd node b) gets transferred to the grid runtime c) gets continuously updated in the grid runtime

and here is a link to my personal notes: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11199/token.txt

amaltaro commented 10 months ago

Short update: we are still failing to get the kerberos token in auto-pilot (based on the keytab). I also took this opportunity to update the text file mentioned in my previous comment.

amaltaro commented 8 months ago

Another update: the keytab has been created - it needs to be updated whenever there is a password change - and that seem to be working properly. However, we are still figuring out where the token is transferred to in the grid job and whether it's properly refreshed. That investigation depends on running workflows (jobs) in the grid and communicating with some experts at FNAL. Given the slow progress on that, I am moving this ticket to Waiting.

In addition, I have also transferred the content of the token.txt file above over to WMCore in GitLab: https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/6

amaltaro commented 6 months ago

Instead of closing this issue out, I see now that I actually misunderstood this GH issue. This ticket seems to be asking for a solution to manage tokens within the agent allowing it to communicate with external services (central services, MonIT, CRIC, Rucio, etc). While I have been working on setting up a token on the agent side and propagate it to the production grid jobs.

New ticket has been created https://github.com/dmwm/WMCore/issues/11968, which was just added to the project board under Q2/2024. I am now demoting/removing this issue from the current quarter.

belforte commented 6 months ago

well, we need both. And AFAIK @mapellidario has been waiting for you to lead on both. Which seemed fine in the spirit "WMA has this almost done, let's see what they have before we dive into it". But if this new effort is only starting now, please feel free to talk with him and find out if he can help. I'd like to see tokens in use in CRAB before Dario leaves at the end of August :-)

belforte commented 6 months ago

note: Brian B. was very clear about "this issue was already solved by e.g. FIFE people at FNAL" and IIRC Farrukh should know everything. IIUC the solution does not require running an OIDC agent. Sorry for noise. LIkely you, Stephan, Valentin already know more/better.

amaltaro commented 6 months ago

@belforte yes, Farrukh was helping with the condor setup on the FNAL schedd side, but as mentioned above, our focus was solely on letting HTCondor manage the token for us and make sure that an up-to-date token would be kept in the grid job and loaded by CMSSW. Unfortunately we are not able to work on any of the other Token related issues for the moment, as there are other higher priority projects taking the team's effort.

I have this documented here: https://cms-wmcore.docs.cern.ch/wmcore/Tokens-in-WMAgent/#next-steps, or here for a better markdown experience.

Lastly, I think another discussion involving the Fermilab team and Brian is going to happen in May, so we might have a clearer roadmap on how to proceed with token integration as well. Nonetheless, we are happy to talk to Dario if he decides to get started on this.

anpicci commented 4 months ago

After talking with @stlammel , we agreed that this issue will be addressed once CMSWEB is tokenized, likely Q1/2025