dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

rewrite global monitor (reqmon) #2416

Closed ticoann closed 12 years ago

ticoann commented 12 years ago

Due to the problem of globalmonitor Lassi was pointing out https://svnweb.cern.ch/trac/CMSDMWM/ticket/2399 and Simon's suggestion (by chat). globalmonitor -> reqmon need to be rewritten.

Brief initial design suggestion is as follows (Thanks to Simon).

  1. WMAgent/LocalQueue push the necessary information to reqmon backend (probably couchDB). (i.e. replicate local couchDB to reqmon couchDB or new component periodically upload the information). This should be minimal information). (Not sure how to handle local couchDB link assuming there will be a firewall).
  2. Information from request mgr, global queue, workload summary will be accessed by REST api provided. (either in time ajax call or periodic cron call update.)
  3. provide reqmon couchapp to view the request status.

Some potential problem might be. Since data is gathered from different sources and different time. It might not be synchronized in the monitor. (job numbers, status, etc)

Please give me any advices and comments.

ticoann commented 12 years ago

sryu: Also I would like to get some guidance on requirement as well. Initially requirements and restriction were,

  1. request monitor will be in time monitoring for the currently running request (or just completed) (not for the archived result)
  2. It needs to have ability to drill down the jobs on the request for more detail information.
  3. agents which request monitor needed to contact are small number. (one for T1, and a few for T2).

I wasn't assuming the agent/localqueue service will have a firewall. (Probably that is wrong assumption). If there will be a firewall, I am not sure how 2 can be done unless necessary information is propagate to the upstream.

spigad commented 12 years ago

spiga: Simon, All this topic isn't somehow related to the 2141 ? If I well understand what you say above should exactly cover the use case discussed there. If it so we could merge the two things...

drsm79 commented 12 years ago

metson: Replying to [comment:2 spiga]:

this topic isn't somehow related to the 2141 ? If I well understand what you say above should exactly cover the use case discussed there. If it so we could merge the two things...

No, not really. #2141 is about job monitoring, this is about agent health monitoring (so to identify large backlogs, stuck requests, higher than usual memory use etc.). They'll use the same architecture/protocol (CouchDB replication) but will be in different databases. I guess there should be some shared experiences and operationally they'll be similar, but I think that's where it stops.

drsm79 commented 12 years ago

metson: I think we need to take a "data driven" approach here. What data do Ops need to see, how is that currently provided and what volume is that data? Some of the drill down stuff might rely on SSH tunnels into the agents (which IMHO is valid) if the amount of data is too high to replicate back (which I suspect is the case). Data appearing out of time is the nature of the beast (we have an eventually consistent system) so we're going to have to think about implications about that.

I think once the sources of data are identified the rest should be relatively simple, because the interface already exists. Seangchan, can you summarise the data that's currently shown, give an example "row" and an estimate of volume and say how you access that data and we can go from there?

ticoann commented 12 years ago

sryu: Replying to [comment:4 metson]:

I think we need to take a "data driven" approach here. What data do Ops need to see, how is that currently provided and what volume is that data? Some of the drill down stuff might rely on SSH tunnels into the agents (which IMHO is valid) if the amount of data is too high to replicate back (which I suspect is the case). Data appearing out of time is the nature of the beast (we have an eventually consistent system) so we're going to have to think about implications about that.

Current data they see (Oli's) is about 4000 requests. although ~80% of them doesn't need to be monitored in here. (We are planning to filter those) Each request contains less then ~400 bytes (rough estimate but it can be reduced - a lot of duplicate string), so data size on the page can be small. (If don't include drill downed information which resides in the local couch). But I don't know whether that number means much. (We are expecting many more requests, correct?)

The information gathered are

  1. from request mgr: request name, status and link to the request itself.
  2. from global workqueue: request name, local queue address per request,
  3. from local queue: request name, wmbsservice address, queue (job) inject (% of jobs) status by request.
  4. from wmbsservice: request name, job running status on batch system (number). local couch link for job status by request.
  5. from local couch db: request name, job status by request (number and state).

and each link for drill down information.

I think once the sources of data are identified the rest should be relatively simple, because the interface already exists. Seangchan, can you summarise the data that's currently shown, give an example "row" and an estimate of volume and say how you access that data and we can go from there?

I will send you the link (old RequestMonitor but the interface are almost identical) which Oli is running.

ticoann commented 12 years ago

sryu: Replying to [comment:6 sryu]: With Oli's permission I am posting link here. http://vocms144.cern.ch:8687/reqmgr/ Due to the bug. It is slower than current version but only ~40% improvement. The other draw back is we are getting data which doesn't need to be monitored (old announced/rejected ...) data, which is more than 80%. (need to be fixed)

ticoann commented 12 years ago

sryu: Note: Oli said the latency is acceptable but problem states (i.e cooloff) needs quick update (not quantified yet).

evansde77 commented 12 years ago

evansde: Dont forget there is an Alert system now. If >N jobs in cooloff is something Oli wants to watch for, then it is by far easier and faster to have the agents send messages via the alert system when that happens than to have an operator clicking a refresh button 24/7.

Would suggest you ask Oli to define the Alert conditions he wants to see in a ticket for the Alert system, it should divide this overview problem up nicely and work better in the long term.

ticoann commented 12 years ago

sryu: Replying to [comment:9 evansde]:

Would suggest you ask Oli to define the Alert conditions he wants to see in a ticket for the Alert system, it should divide this overview problem up nicely and work better in the long term. added a ticket for this (#2470)

ghost commented 12 years ago

lat: Just a reminder from earlier exchanges not on this ticket - it needs to be such that making HTTP request to reqmgr doesn't generate a large / unpredictable number of requests elsewhere. I.e. the aspect where an off-line process collects info from other sources -- for example those other sources push their info into central couchdb somehow, and it's collated into a summary there.

ghost commented 12 years ago

lat: On unrelated note, clicking around http://vocms144.cern.ch:8687/reqmgr I chanced to produce output which includes URLs with username/password embedded in them.

Please take this server offline immediately until the problem has been addressed, i.e. no passwords are exposed to clients. You will also have to change all affected passwords as they have now been exposed.

Please contact me offline for a description on how to get at the sensitive information.

cinquo commented 12 years ago

mcinquil: Replying to [comment:12 lat]:

On unrelated note, clicking around http://vocms144.cern.ch:8687/reqmgr I chanced to produce output which includes URLs with username/password embedded in them.

I think this is the same issue that was reported on #2397

sfoulkes commented 12 years ago

sfoulkes: I fixed the ReqMgr not to store the couch passwords in the specs and changed the couch password.

ghost commented 12 years ago

lat: Replying to [comment:13 mcinquil]:

I think this is the same issue that was reported on #2397

Sort of, but I see a lot more URLs containing passwords. I reopened #2397 and pasted a few example URLs.

ghost commented 12 years ago

lat: Replying to [comment:14 sfoulkes]:

I fixed the ReqMgr not to store the couch passwords in the specs and changed the couch password.

Thanks, but I still see plenty of passwords around. I am not really sure if there even is "the" couchdb to fix, as reqmgr seems quite happy to pull data from various sources.

At least I find it quite easy to get sent to some random server somewhere at FNAL or CERN. It always takes me a while to realise I got sent to some other server, or that some error message from ReqMgr actually means that it was trying to contact some (dev-only?) server somewhere else, and it appears to be down.

ghost commented 12 years ago

lat: To generate plenty of passwords, go to the vocms144 reqmgr server, then pick say 'announce', pick just about anything from list, then click on "workflow" link.

For one, that generates a link with full HTTP URL in it. I tried various URLs (including www.google.com and xyzzy.cern.ch) and some of them go through with "404 not found" rather than "400 bad request".

For another the resulting documents have plenty of passwords in URLs.

ghost commented 12 years ago

lat: For further clarification, I'd like to have the server shut until it's confirmed for good it has stopped spewing out passwords. Specifically, we can't leave it running like this over weekend, so if it's not sorted by tomorrow afternoon, I need to ask CERN security to shut it off the network. Please let's not go down that line.

evansde77 commented 12 years ago

evansde: We will tweak the settings on those agents to bind the monitoring information to localhost only. This will get done this afternoon, which will upset Ops but buy us time to put a proper fix in. Also, the passwords in those specs have already been changed so they should be ~useless.

Will follow up later this afternoon, but you dont need to go nuclear yet.

ghost commented 12 years ago

lat: Thanks!

sfoulkes commented 12 years ago

sfoulkes: Lets move discussion to 2397. I posted a patch there to fix the showWorkload problems, I think that was the only outstanding issue. The ReqMgr on vocms144 has been restarted and bound to the loopback interface.

ticoann commented 12 years ago

sryu: Replying to [comment:11 lat]:

Just a reminder from earlier exchanges not on this ticket - it needs to be such that making HTTP request to reqmgr doesn't generate a large / unpredictable number of requests elsewhere. I.e. the aspect where an off-line process collects info from other sources -- for example those other sources push their info into central couchdb somehow, and it's collated into a summary there.

Based on Simon and Lassi's suggestion I drafted an initial design document. https://svnweb.cern.ch/trac/CMSDMWM/wiki/ReqMonDesignDraft I will appreciate comments and suggestions.

cinquo commented 12 years ago

mcinquil: Replying to [comment:22 sryu]:

Based on Simon and Lassi's suggestion I drafted an initial design document. https://svnweb.cern.ch/trac/CMSDMWM/wiki/ReqMonDesignDraft I will appreciate comments and suggestions.

Thanks for this...few questions here below. A part of delays given by polling times and propagation of information, are the completed requests going to be monitored on it for a while since when are finished or just request really running are going to be showed? And how the 'log rotate' mechanism is going to work in order to remove old information from the ReqMon couch db? Then I guess that in case of distributed deployment it will be possible to access the local agent couch information just in case the port are open. But, more important: I guess that not everyone would be able to write to the central ReqMon couch (which will have to be open to the outside world), but just some specific service certificates (on the agents) will be allowed to write on it...right? In the end, is this not going to use couch db replication but a dedicated component from the agent that will push the information to ReqMon couch db?

ticoann commented 12 years ago

sryu: Replying to [comment:23 mcinquil]: Thank you for the comments

A part of delays given by polling times and propagation of information, are the completed requests going to be monitored on it for a while since when are finished or just request really running are going to be showed? Yes, it should be Oli wants to keep completed/announced request for a while < 2 weeks

And how the 'log rotate' mechanism is going to work in order to remove old information from the ReqMon couch db? Sorry, I am not sure what it means by 'log rotate' mechanism. Old information will be deleted as soon as new information is available. So basically each agent will overwrites previous reports since the information is temporal and doesn't have much meaning to it when the new information is arrived. I need to have some mechanism to deleting not updated report (due to agent crash or some other reason) in timely manner. Not sure what will be the policy on that.

Then I guess that in case of distributed deployment it will be possible to access the local agent couch information just in case the port are open. But, more important: I guess that not everyone would be able to write to the central ReqMon couch (which will have to be open to the outside world), but just some specific service certificates (on the agents) will be allowed to write on it...right? Yes that is the plan. I think authorization list comes from siteDB (not sure about the either)

In the end, is this not going to use couch db replication but a dedicated component from the agent that will push the information to ReqMon couch db? Yes, I think it doesn't make much sense to use couch db replication since the data itself is transient (data is only valid in the certain time period) and doesn't need to be kept in local db (all the data is there already just not with the same format).

sfoulkes commented 12 years ago

sfoulkes: Replying to [comment:22 sryu]:

Based on Simon and Lassi's suggestion I drafted an initial design document. https://svnweb.cern.ch/trac/CMSDMWM/wiki/ReqMonDesignDraft I will appreciate comments and suggestions.

Looks fine to me.

ghost commented 12 years ago

lat: Replying to [comment:22 sryu]:

https://svnweb.cern.ch/trac/CMSDMWM/wiki/ReqMonDesignDraft

Thanks. Could you add two pictures? The first should show boxes or circles of sites (CERN, FNAL, ...), hosts in them, which databases + services are hosted where, and where clients accesses come in, and where firewalls are set up and which way around. The other picture should show a couple of typical communication patterns, with numbered arrows showing progress of communication, i.e. which service on which host and site talks to what to satisfy a particular type of user request, or update from upstream data source.

It would also be helpful to get an idea about deployment, especially which privileges are required and where, who is authorized to read/write/update what information, etc., but also how many instances will run on which sites, how autonomous / maintained they are, and so on. Apropos, some basic taxonomy of information stored would also be useful.

I have some trouble reading the document, mainly because there are many sentences which do not appear to be connected to anything. To give just one example, towards the end there is a sentence "option is selected above", I have a hard time working out what it means and what consequences it might have. I would request you work with someone with strong editorial skill to streamline the text to make it more understandable.

In part because of the above items, I am not sure how to understand the document - I am not sure where push or pull happens, or where host or site or firewall boundaries are being crossed. I don't also understand from the document if the service will be maintaining only current state with no history, or retains some historical information, and if so, if that historical data is summary of some sort, aggregated in some way, etc. I am afraid I can't really say at this point if I agree with the document or not.

I would also find it useful if you simply state some timeline guarantees from the application, and ask other service to agree with them. For example the eventual consistency matter, simply state within what time limit you expect data to be consistent, and others need to ok that (or not). We then need some summary of how the inconsistencies are dealt with -- and how they are expected to be exploited the system for robustness, resilience, performance, etc.

My general feeling of the document is that it leaves many options open. I appreciate it may be because you are not sure about various aspects, but the purpose of the document is to remove the uncertainties as much as possible before writing any code. I would find it helpful if the document was much more definite in its statements, which in turn may require you do research, e.g. understand deployments and authorization to the extent you can be definite about them.

drsm79 commented 12 years ago

metson: I started to write a response to this a few weeks back but never completed it. Hopefully useful comments follow, apologies if this goes over things that have already been discussed/resolved.

Part of the refractor is going to be trying to get the genie back in the bottle. There are certain things that people have maybe got used to against a test system that's not practical in a genuinely distributed environment. We could host all the agents at CERN if the requirement on monitoring is more important than the requirement to have the agents run off site. In short I I think we need to re-evaluate some of the requirements.

ticoann commented 12 years ago

sryu: Replying to [comment:27 metson]: Hi Simon, I have a few question regarding the requirement.

  • Globally accessible secure remote access to agent REST interfaces/CouchDB instances is difficult and scary
  • This is providing monitoring information not (more detailed) debugging information
  • Drilling down to debug problems is only as useful your access to the machine hosting the agent (e.g. if you can't log into the box to restart a component you don't need to know the temperature of the CPU's)
  • Ops run the Agents on machines they defacto have access to, and know how to use SSH tunnels

Comparing current monitoring, what needs to be viewed (or drill downed) and (vs) what information can be behind firewall and accessed only for Ops who can access the machine? ) - Maybe this is the question for data ops.

  • We shouldn't reinvent CouchDB replication - the default propagation method from an agent to the central record should be continuous replication (which should mitigate some of the latency issues), pushing from the agent record to the global.
  • "log rotate" style databases can be done via the RotatingDatabase class in CMSCouch, however I don't know how that would work with the continuous replication, so there's a conflict of interest there. Maybe for the records local to the agent the RotatingDatabase should be used and the central record we'll have to think about further.
  • The central record should be historical information (cf. StageManager) for people to see long range trends. Debugging information is not stored centrally.

Initial requirement of the monitoring was not historical information, it was the snapshot of the current status of request. So I thought couchdb replication is not necessary.

  1. local couch db doesn't need to have the summary snapshot.
  2. The summary data is only valid until next one is updated (or within the certain time period. - I don't know exact policy on that)

If the monitoring is dealing with historical information (what information is needed to be kept?)