dmwm / CMSRucio

7 stars 31 forks source link

Datasets monitoring service transition into Rucio infrastructure #382

Closed mrceyhun closed 1 month ago

mrceyhun commented 1 year ago

As it is suggested in "Fall22 Offline and Computing Week" by @nsmith- , I am opening this issue to discuss and perform the transition of Rucio dataset monitoring web service(http://cmsweb-test1.cern.ch:31280/) to Rucio UI. First of all, we need to decide whether to migrate whole pipeline to Rucio infrastructure or just the Go web service.

You can find the full pipeline structure in O&C week presentation - Data Management Status of Monitoring and Tools, slide 8 of "3- Tabular datasets monitoring".

Here are the 3 services we are running in cmsweb-test1 cluster to implement this web page:

Each of them can be deployed standalone to any cluster as soon as they are in CERN network to not deal with authentication. We can parametrize their connection endpoints which are hard coded currently. As CMS Monitoring, we are ready to give all our support to make the transition. And also we can say that we are always ready to delegate responsibilities of this monitoring to Rucio or Data Management team in any time. If needed, we can set meeting to transfer Go web service, JQuery DataTables and Spark job technical details to dear colleagues.

fyi @leggerf @brij01 and @vkuznet who supported us in each step of the development process.

vkuznet commented 1 year ago

@mrceyhun , the cmsweb-test1 is a dev cluster allocated to WMCore DMWM team, see here. As such, you should not use it to deploy rucio stuff as DMWM team member may decide that (s)he need a full cluster and wipe out all services. Instead, I suggest to create dedicated cluster for this and deploy all stuff over there. You may request a new cluster from HTTP team.

leggerf commented 1 year ago

I think we should really take the chance here to move it to a different cluster created for this purpose, and hand it over to the rucio team

On 3 Nov 2022, at 13:04, Valentin Kuznetsov @.***> wrote:

@mrceyhun https://github.com/mrceyhun , the cmsweb-test1 is a dev cluster allocated to WMCore DMWM team, see here https://cms-http-group.docs.cern.ch/k8s_cluster/cmsweb_developers_k8s_documentation/. As such, you should not use it to deploy rucio stuff as DMWM team member may decide that (s)he need a full cluster and wipe out all services. Instead, I suggest to create dedicated cluster for this and deploy all stuff over there. You may request a new cluster from HTTP team.

— Reply to this email directly, view it on GitHub https://github.com/dmwm/CMSRucio/issues/382#issuecomment-1302002364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4EWQKESSMZUJ6SZBQFKIDWGOS5RANCNFSM6AAAAAARWBPPVI. You are receiving this because you were mentioned.

mrceyhun commented 1 year ago

@vkuznet you are right and I am aware of it for a while unfortunately. It was started as a test and continued as it is.

Actually, I don't know how Rucio UI works. If they can add(connect) a web service which is running in a different ciuster, a new K8s cluster makes sense. In this case, everything will be deployed in a dedicated K8s cluster, but web service will be accessed from Rucio UI. If possible of course.

dynamic-entropy commented 1 year ago

Hi @mrceyhun If the Monit team wants to move this to be managed under as part of https://github.com/dmwm/CMSRucio codebase. I can propose this in the CMSRucio Biweekly meeting.

However if we are talking about integration into the central rucio codebase. Right now, I am not sure how that can be made possible. A few issues I see with that would be.

All of these are just personal observations, so there is a slight chance they are flawed. I can let you know of a decisive answer only after the biweekly meeting.

mrceyhun commented 1 year ago

@dynamic-entropy Rahul, first of all, we don't enforce or specifically want anything. Since the intention of L1s and some CMS Rucio colleagues, we suggest to transfer this running in production service to under your responsibility. Therefore, we don't suggest to merge this service code-base to CMSRucio. It is up to you to find a proper plase to fit this codebase, if you agree in previous suggestions. Because CMS Rucio colleagues use Github heavily, we opened this issue in this repo. What I believe and see is that: you have more detailed domain expertise and personal power, therefore you can manage it better than us. We are always ready to support you in your needs.

It would be great if you discuss this in your group and great for us to understand our responsibilities. Keep in your mind that I will not be around starting with the second half of this year.

dynamic-entropy commented 1 year ago

Yup, completely agreed. We will try to find a proper place. For me, CMSRucio seems valid too, since this can be categorised as tooling around rucio for CMS. And is a great tool for users and operators alike.

dciangot commented 1 year ago

@mrceyhun I might have lost a lot of details in this long thread, that said, I'd like to understand if the current proposal is to get the three components to work on the same cluster where RUCIO runs, or just to collect the manifests under CMSRucio repo + rucio-flux to delegate maintanance/ownership?

Side question: I suppose this components can be put under an Helm chart, is this correct?

dciangot commented 1 year ago

https://mattermost.web.cern.ch/cms-o-and-c/pl/p4rkenbqstf9986sm96iqzx9uh as per discussion on mattermost, we are going to discuss the last details in the Monitoring Wed meeting.

dciangot commented 1 year ago

@ericvaandering could you validate/comment the following plan to transfer the dataset monitoring dashboard into DM/Rucio ownership?

  1. mv https://github.com/dmwm/CMSMonitoring/tree/master/rucio-dataset-monitoring into CMSRucio/src/go/ or in a dedicate repo (personally I'd go with the former)
  2. put the docker images from here https://github.com/dmwm/CMSKubernetes/tree/master/docker/rucio-dataset-monitoring to CMSRucio/docker
  3. create a dedicated k8s cluster. And add its management under rucio-flux
  4. put the needed k8s manifests into flux apps
  5. won't work, debug with @mrceyhun :)
dciangot commented 1 year ago

still pending, executive summary: @dciangot will provide a document to be discussed for the first implementation to be discussed and agreed at one of the next cms rucio dev meetings

ericvaandering commented 9 months ago

Thanks for the ping on this. I guess I did not do my homework on this. My question looking at this is why would we need/want a dedicated cluster for this. Especially if we are just talking about the production of the data.

leggerf commented 9 months ago

it does not need to be in a dedicated cluster. It is designed to run in k8s. Whether to deploy it in a separate cluster or in one you already have for other services is up to you to decide. Let us know how you want to proceed with this

ericvaandering commented 9 months ago

Is there a helm chart for this already then, or is it done with native k8s yaml files? In either case, it would be good to see what they are. What kind of resources are needed?

leggerf commented 9 months ago

we do not have helm, so it's plain yaml files. All the details about needed resources and location of yamls are given in the first post of this issue: https://github.com/dmwm/CMSRucio/issues/382#issue-1434525717

ericvaandering commented 9 months ago

The first link for 2 of the 3 bullet points is a 404 error. I suspect you moved things around :-)

leggerf commented 9 months ago

Right...here with fixed links:

As it is suggested in "Fall22 Offline and Computing Week" by @nsmith- , I am opening this issue to discuss and perform the transition of Rucio dataset monitoring web service (https://cms-dm-monitoring.cern.ch/) to Rucio UI. First of all, we need to decide whether to migrate whole pipeline to Rucio infrastructure or just the Go web service.

You can find the full pipeline structure in O&C week presentation - Data Management Status of Monitoring and Tools, slide 8 of "3- Tabular datasets monitoring".

Here are the 3 services we are running in cmsweb-test1 cluster to implement this web page:

Each of them can be deployed standalone to any cluster as soon as they are in CERN network to not deal with authentication. We can parametrize their connection endpoints which are hard coded currently. As CMS Monitoring, we are ready to give all our support to make the transition. And also we can say that we are always ready to delegate responsibilities of this monitoring to Rucio or Data Management team in any time. If needed, we can set meeting to transfer Go web service, JQuery DataTables and Spark job technical details to dear colleagues.

ericvaandering commented 9 months ago

OK, some basic questions.

The MongoDB. Is this a customized MongoDB for CMS use or is just the docker image customized? Is there a reason we couldn't use an existing helm chart to run a stock MongoDB with a docker image supplied by Mongo?

The web app. We don't direct traffic by endpoint, we direct it by hostname. So we'd need to add a hostname to the cluster for this application. But more importantly, these are accessible from offsite with firewall exemptions. And we rely on the application to do its own authentication/authorization. Is that an issue here? Without any AA, this will be world readable.

ericvaandering commented 9 months ago

For the cronjob, I don't really understand the service associated with a cron job. And with ports. That's not something I've seen before. What is the point of that?

leggerf commented 9 months ago
ericvaandering commented 9 months ago

Usually when you declare a service and ports, those are incoming ports. But you’re describing outgoing connections. So I’m still confused as to whether those are needed.

If the web server requires SSO then I think it best resides behind CMS web. While Rubio is in the CMS web open stack project, we don’t use any of its security.

And because the web server uses Mongo as a data source, I think it makes sense to keep that in CMS web as well, right?

So I think we should concentrate on the cron job as the best fit for moving to CMSRucio. Agreed?

leggerf commented 9 months ago

Yes, I agree, let's start with the cronjob. This is also the part that requires rucio domain knowledge in writing the spark aggregations.

For what concerns security, I thought the initial plan was somehow to move the web app to the Rucio web UI, which I assumed it was behind SSO or some other form of security. Also, I don't know how the rucio UI works, which storage solution it uses. Not MongoDB then?

For what concerns the ports, these are the ones that need to be defined when connecting to the spark analytix cluster as described here.

ericvaandering commented 9 months ago

First on the ports: https://hadoop-user-guide.web.cern.ch/spark/spark_ports/ does seem to imply that those ports are needed for incoming connections. This will be interesting....

As far as I know, it's not possible to easily integrate something "arbitrary" into the Rucio WebUI. The WebUI is changing massively, probably next year, so we can evaluate I guess.

The WebUI has no storage; it's pages that collect data from the Rucio server and display them. So if the next version is extensible, it may be possible to add our own Javascript which would sit behind the authorization and connect with a go application running in the same cluster. Very hard to say at this point. :-)

ericvaandering commented 9 months ago

Another question: PUSHGATEWAY_URL -- Is this a prometheus pushgateway? If so, we run one in the Rucio cluster so we should be able to push metrics there and not have to connect outside the cluster. (Ours feeds into MONIT of course, via the CMS HA monitoring servers).

nikodemas commented 9 months ago

Yes, PUSHGATEWAY_URL refers to the Prometheus pushgateway.

Coming back to the question about MongoDB - it doesn't seem that the image we are using has some complex configuration (see Dockerfile and yaml file for Mongo deployment), so with some modifications it should work with a standard Mongo image too. In the CMS Monitoring infrastructure we don't use the Helm charts yet, so that could be a reason why it is defined like this.

vkuznet commented 9 months ago

@ericvaandering , there are three components:

Therefore, if you want to port it under Rucio umbrella the right things to do (in my view) would be:

ericvaandering commented 9 months ago

OK, let me back up a little to try to understand the impetus behind this.

Is the existing MongoDB used for this NOT the CMSWeb MongoDB? If not, then I would suggest the first thing for the monitoring team to do would be to move to using a centrally supported MongoDB. As already discussed, I think neither the MongoDB or SSO-enable go application are likely to be be good candidates, at least initially, for running in the CMSRucio cluster.

Then, who is going to do this work? Even if “the work” is porting the spark cronjobs over to CMSRucio, that breaks down into about 4 tasks that I can think of:

  1. Move the code for the docker image to CMSRucio and perhaps make things more configurable

  2. Set up automatic builds and pushes of the docker image

  3. Convert the Kubernetes manifests currently used into a helm chart in CMSRucio

  4. Integrate, with flux, that helm chart

  5. Should be someone from the monitoring team

  6. Rahul can probably help with this

  7. Here this should be someone from the monitoring team with my advice

  8. Would be me.

I should point out 1 & 2 and 3 & 4 are independent tasks. E.g. we can set up a helm chart and flux with the existing docker image

Cheers,

Eric

On Oct 18, 2023, at 8:58 PM, Valentin Kuznetsov @.***> wrote:

@ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=CAylN7_nuzgW4IA9wN2ctFa_z13FePwQAqOOePJ6qM0Rv_L0dgsrOXhOgX5qFkbr&s=uf2U8UnZpfIbcEt8a-OENGQ8THI-a4DoWxaEwvT3iyw&e= , there are three components:

mongodb itself, and it can be run elsewhere, either on k8s or not. It can use standard MongoDB docker image. Moreover, CMSWEB group provides MongoDB as a Service which is used by WM services. Therefore, if you want you can either use CMSWEB MongoDB, set it up on your own k8s cluster or put it on dedicated machine spark jobs, run via cron, which collects and feed information into MongoDB. For spark, it is required to have open ports for spark to communicate. Therefore, spark jobs can write directly to specific MongoDB URI which can be visible within CERN network only. finally, web service, written in Go or you may have a different implementation, e.g. add it to Rucio web. Said that, it requires connection to MongoDB to fetch the data. Therefore, you may port existing Go service into your cluster, add new layer into Rucio web service (by using pymongo driver to access MongoDB), and/or develop JavaScript frontend to represent data in web UI. I don't know if JS has ability to access MongoDB directly but there is Node.js support https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3schools.com_nodejs_nodejs-5Fmongodb-5Fquery.asp&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=CAylN7_nuzgW4IA9wN2ctFa_z13FePwQAqOOePJ6qM0Rv_L0dgsrOXhOgX5qFkbr&s=ReTJyiyCbjEytElxk-m1UPHiIMoRHWM8qNo3SLC-eiE&e= Therefore, if you want to port it under Rucio umbrella the right things to do (in my view) would be:

ask CMSWEB to give you access to MongoDB (they will create you new credentials and you may create necessary database). Ask them to provide k8s port, they can make MongoDB port accessible within CERN network port cronjobs to Rucio cluster (if this is your goal) and run it to produce data. But crons will require access to spark cluster and for that you'll need open ports (again within CERN network) adjust crons to feed your MongoDB either take as is Go service, or later develop your own service to represent the data. But for that your service will need to read from MongoDB and therefore should have ability to access it from your programming language of choice. For python it is pymongo https://urldefense.proofpoint.com/v2/url?u=https-3A__pymongo.readthedocs.io_en_stable_index.html&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=CAylN7_nuzgW4IA9wN2ctFa_z13FePwQAqOOePJ6qM0Rv_L0dgsrOXhOgX5qFkbr&s=m6F-EMwKE9ISLmc-hr1_aaea7oUdpgA6Q8a4hoWxTa0&e= module. I used a long time ago for DAS, now WM services use it to access MongoDB. — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_382-23issuecomment-2D1768297329&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=CAylN7_nuzgW4IA9wN2ctFa_z13FePwQAqOOePJ6qM0Rv_L0dgsrOXhOgX5qFkbr&s=82mRXby-t20NXs_mLK8VNgNB986ddUAjC03jDP8s3NI&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLSDCDQIEOQU24J2PODX76767AVCNFSM6AAAAAARWBPPVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYGI4TOMZSHE&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=CAylN7_nuzgW4IA9wN2ctFa_z13FePwQAqOOePJ6qM0Rv_L0dgsrOXhOgX5qFkbr&s=CmMGgGtswuAcZwaKeh6Rmiz-RKPKJTFyGlb_kkBRkZk&e=. You are receiving this because you were mentioned.

ericvaandering commented 8 months ago

@vkuznet please see above...

nikodemas commented 8 months ago

@ericvaandering I will start working on the 1. step in your list and will let you/Rahul know once we are finished with that so we can move to the next stage.

Panos512 commented 6 months ago

Hi @nikodemas, happy new year! Where are we with this one? Once we port the code to CMSRucio I can try helping with the manifests and k8s deployment :)

nikodemas commented 6 months ago

Hi @Panos512, Happy New Year! In the next few days we are planning to update the tool for it to use the general cmsweb MongoDB instead of our custom one (we have tested it out on preprod before Christmas) and then we can start with the Kubernetes stuff. You can see a bit more details on our internal Jira ticket CMSMONIT-516.

Panos512 commented 6 months ago

awesome! Thanks @nikodemas

ericvaandering commented 1 month ago

Won't do?