dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Fix dashboard reporting #89

Closed sfoulkes closed 11 years ago

sfoulkes commented 13 years ago

http://www.google.com/url?sa=D&q=https%3A%2F%2Fhypernews.cern.ch%2FHyperNews%2FCMS%2Fget%2Fdataops%2F1383.html

DMWMBot commented 13 years ago

mnorman: I've tightened up the code so that it better fits the Dashboard formatting. I've also decided more or less unilaterally to use the job name (which is UUID based) as our universal job identifier.

However, I have not implemented (and do not plan to implement for the current milestone), the code to report this from the JobAccountant, as it would a) slow down the JobAccountant, and b) probably DDOS Dashboard, and then I would get crap from two different directions.

sfoulkes commented 13 years ago

sfoulkes: New plan:

Matt is going to come up with a list of changes he'd like made to the dashboard API.

sfoulkes commented 13 years ago

sfoulkes: Patch for first half of this attached. Dave, review?

evansde77 commented 13 years ago

evansde: Patch committed for the views & change state.

Leaving this ticket open since it seems there is some more to come.

cinquo commented 13 years ago

mcinquil: This is an example of the entry I get when looking at the new view:

{"id":"3","key":3,"value":{"index":"1","id":3,"name":"d5659174-fdfd-11df-8e31-0026b95c499b-0","requestName":"CmsRunAnalysis-provaDashboard","retryCount":0,"newState":"executing","oldState":"created"}},

If I well understand the 'index' is counting how many times the state is changed.

So, if I well understand there will be a MonitoringReporter component running on the Agent and calling the view to understand which jobs need to be reported. Then, for the jobs that need to be reported the new component queries the database (eg: wmbs, bossair, ..) to collect the needed information and through the 'Services.Dashboard.API' module it reports the needed information to the dashboard. In parallel the job will report the information from the WN, using the same job unique indentifier (job/task/retrycount/etc). Am I understading it right?

sfoulkes commented 13 years ago

sfoulkes: Index is an index into the state array in the job document in couch. The MonitoringReporter shouldn't call the view, it should call listTransitionsForDashboard() in WMCore/JobStateMachine/ChangeState.py which takes care of updating couch as state transitions are reported.

Everything else is right.

cinquo commented 13 years ago

mcinquil: Replying to [comment:15 sfoulkes]:

Index is an index into the state array in the job document in couch. The MonitoringReporter shouldn't call the view, it should call listTransitionsForDashboard() in WMCore/JobStateMachine/ChangeState.py which takes care of updating couch as state transitions are reported.

Everything else is right.

Ok for the index. And yes I was meaning that about the listTransitionsForDashboard method. Thanks for the explanation. So, we are in the same picture and this is the actual list of parameters being reported to dashboard from the current CRAB2-PA systems: https://twiki.cern.ch/twiki/bin/viewauth/CMS/CompleteListOfAttributesCurrentlyReportedToDashboard Probably if we are sure to have an unique job name from CMS, we might put less work on the component that will send the messages.

sfoulkes commented 13 years ago

sfoulkes: Currently all the WMAgent jobs names as UUIDs, so they'll be unique.

DMWMBot commented 13 years ago

mnorman: I think what we need here is a list of the absolute bare minimum information that the Dashboard can operate on. A lot of the information sent by the current PA either doesn't make any sense in WMAgent or is hard to find and has little benefit. We need to look at what the minimum that can be supplied is.

DMWMBot commented 13 years ago

mnorman: Preliminary DashboardReporter component prototype attached.

cinquo commented 13 years ago

mcinquil: Replying to [comment:19 mnorman]:

Preliminary DashboardReporter component prototype attached.

Before doing any commit of this component, I need to know if it has been tried with the dedicated monalisa server avilable for CMS testing available here at cern. If not, can you try it and let my know so we can have a look if it works?

About the information reported that is not enough and coherent for what dashboard needs...it is just matter to define the flow of messages required. I suggest to have the unique task name (even between different agents), unique job id (even for job resubmission) and at least for the first iterations to keep the current attribute names in the messages sent (otherwise will require some time to make it working, since it implies many changes in the logic in order to have it working with other messagges reported). What is needed is to send those messages: 1) workflow definition (task meta) 2) job submitted (job meta) 3) job status change 4) job in WN

I made a detailed example of what needs to be reported to have it working and having the main information reported and visible on the dashboard services. So, this is the set of attributes: https://twiki.cern.ch/twiki/bin/view/CMS/CompleteListOfAttributesReportedToDashboardByWMAgent#Set_of_attributes_that_should_be

Then, when we arranged to have it working we might fill the table in that twiki (so, I left the WMAgent columns empty for the moment).

evansde77 commented 13 years ago

evansde: Couple of initial comments related to the stuff on the TWiki.

  1. Request name = task name. Also contains user defined fields, so having things parsed on underscores etc will break when someone comes up with _My-Request_That-Contains@SpecialCharacters.
  2. Why should the task be unique based on agent instance? The request as a whole is what people will care about, the agents actually running it are just meta information.
  3. Job IDs are globally unique UUID based things in WMAgent anyway. Including the retry count is fine, but IMO you would be better off just using /request/jobID as the job name and packing everything else like retry count and wmagent name etc into the key value stuff.

I realise that this means changes on the dashboard side, and we should "respect the existing infrastructure" but this is a good time to sanitise stuff and make changes as analysis, production etc all converge on the same set of conventions. Some of the features of the dashboard currently in existence (like underscores in the right places) have caused a bunch of problems in the past and should be addressed to make the system more stable/robust going forward.

There are several assumptions the dashboard has made about job structures etc that persist in the initial list of parameters you have. What if a job runs more than one CMSSW version? What it if has more than one input dataset? What happens if we submit with a whitelist of sites? How do the different steps report in as they run?

I think starting from a blank slate with some of this stuff has some pretty major advantages. In the past the dashoboard has tried to dictate what a job looks like to us and it has caused a bunch of issues. Could we perhaps start with defining what the job process is, and working from that to generate the list of appropriate metadata to send and when? Otherwise we end up with the same kind of stalemate we have had in the production system for the last couple of years, and you are going to be slap bang in the middle of it ;-)

cinquo commented 13 years ago

mcinquil: Replying to [comment:21 evansde]:

Couple of initial comments related to the stuff on the TWiki.

  1. Request name = task name. Also contains user defined fields, so having things parsed on underscores etc will break when someone comes up with _My-Request_That-Contains@SpecialCharacters.
  2. Why should the task be unique based on agent instance? The request as a whole is what people will care about, the agents actually running it are just meta information.

Because there is the need to map a CMS workflow/task to a dashboard workflow/task. You need a unique identifier and an unique name is a good idea (up to me it can also be an id, but I do not expect the workflow id in the database to be a general unique identifier). I agree and understand (also that parsing the underscore is a not very stable solution), anyway up to me it is not a problem if we want to make also tht workflow/task-name an UUID.

  1. Job IDs are globally unique UUID based things in WMAgent anyway. Including the retry count is fine, but IMO you would be better off just using /request/jobID as the job name and packing everything else like retry count and wmagent name etc into the key value stuff.

As above, I agree that parsing is not the good way. In fact DBoard just need a unique name for the job to identify it in a unique way (I think the retrycount a the end of the job name has some logic) and DBoard also needs a dedicate attribute for the retry count. Moreover, I just realized that I forgot to add the id of the jobs and the retry count in the job meta information. Now it is included in the twiki.

I realise that this means changes on the dashboard side, and we should "respect the existing infrastructure" but this is a good time to sanitise stuff and make changes as analysis, production etc all converge on the same set of conventions. Some of the features of the dashboard currently in existence (like underscores in the right places) have caused a bunch of problems in the past and should be addressed to make the system more stable/robust going forward.

I see and understand the point. I agree with you that we cannot completely force the applications (well, probably some standards should be respected) to report to the monitoring. Keep in mind that having an approach closer to WMAgent is possible. This can be done because the CMS interface to dashboard is being re-engineered and probably now it is a good time to receive inputs on what is needed and probably there are handles to implement the needed requirements.

There are several assumptions the dashboard has made about job structures etc that persist in the initial list of parameters you have. What if a job runs more than one CMSSW version?

Is the same job on the Working Node running on two different CMSSW versions or resubmissions of the same job?

What it if has more than one input dataset?

If this has to be handled, this can be a good input of what is needed to have in the next version (if it is something needed now I can ask for a way to have it working).

What happens if we submit with a whitelist of sites?

This should be already available (CRAB is using this). JobMeta information includes TargetCE attribute that can be a comma separated list.

How do the different steps report in as they run?

This is being handled. Probably it is not 1 to 1 mapping with the concept of steps, but information about cmsRun execution time I see that are being handled and the same for stageout (probably for wmagent we need something for the logarchive).

I think starting from a blank slate with some of this stuff has some pretty major advantages. In the past the dashoboard has tried to dictate what a job looks like to us and it has caused a bunch of issues. Could we perhaps start with defining what the job process is, and working from that to generate the list of appropriate metadata to send and when? Otherwise we end up with the same kind of stalemate we have had in the production system for the last couple of years, and you are going to be slap bang in the middle of it ;-)

As said before and on the twiki I think that if we want to have it working in less then 1 month with Christmas vacations on the middle we cannot start rewriting everything from the start. My proposal is to have something that can start to work soon in order to show WMAgent jobs in a consistent way (this is the message I understood form the dmwm-integration-dataops meeting). Then, while we do this we can understand what's needed and what is missing, giving a good input for the next generation of the interface between WM tools and DBoard database, where we can really start from ~blank situation (keeping some constraint with the architecture and with the system). If you agree on this we might start from the schema proposed on the twiki adding on the notes (and at some point a dedicated twiki) the input for the next re-engineering (that is not something that needs 1 week of work...).

evansde77 commented 13 years ago

evansde: Replying to [comment:22 mcinquil]:

Replying to [comment:21 evansde]:

Couple of initial comments related to the stuff on the TWiki.

  1. Request name = task name. Also contains user defined fields, so having things parsed on underscores etc will break when someone comes up with _My-Request_That-Contains@SpecialCharacters.
  2. Why should the task be unique based on agent instance? The request as a whole is what people will care about, the agents actually running it are just meta information.

Because there is the need to map a CMS workflow/task to a dashboard workflow/task. You need a unique identifier and an unique name is a good idea (up to me it can also be an id, but I do not expect the workflow id in the database to be a general unique identifier). I agree and understand (also that parsing the underscore is a not very stable solution), anyway up to me it is not a problem if we want to make also tht workflow/task-name an UUID.

OK, I just think that if the CMS workload ID is the same as the dashboard workload ID you end up with a more coherent picture across the entire system. Eg, user submits request X, monitors X, looks for output from X where X is the same identifier across the entire system means it will be more usable.

I think requiring that the workload ID be globally unique is a good thing (ReqMgr does it, and its easy to add a GUID for systems that dont go through the ReqMgr) & something we should ensure in the WM System. If we can then use that for the dashboard cluster/task ID we are making progress.

  1. Job IDs are globally unique UUID based things in WMAgent anyway. Including the retry count is fine, but IMO you would be better off just using /request/jobID as the job name and packing everything else like retry count and wmagent name etc into the key value stuff.

As above, I agree that parsing is not the good way. In fact DBoard just need a unique name for the job to identify it in a unique way (I think the retrycount a the end of the job name has some logic) and DBoard also needs a dedicate attribute for the retry count.

OK (although I still think the idea of a resubmitted job having the same ID but going through a state cycle with retry count incrementing is a more accurate reflection of what is happening IRL)

Moreover, I just realized that I forgot to add the id of the jobs and the retry count in the job meta information. Now it is included in the twiki.

I realise that this means changes on the dashboard side, and we should "respect the existing infrastructure" but this is a good time to sanitise stuff and make changes as analysis, production etc all converge on the same set of conventions. Some of the features of the dashboard currently in existence (like underscores in the right places) have caused a bunch of problems in the past and should be addressed to make the system more stable/robust going forward.

I see and understand the point. I agree with you that we cannot completely force the applications (well, probably some standards should be respected) to report to the monitoring. Keep in mind that having an approach closer to WMAgent is possible. This can be done because the CMS interface to dashboard is being re-engineered and probably now it is a good time to receive inputs on what is needed and probably there are handles to implement the needed requirements.

Yup, we should document the structure and lifecycle of the WMAgent job for this. Eg, Workloads/Tasks/Jobs/Steps, the JSM etc. and pinpoint each dashboard transaction.

There are several assumptions the dashboard has made about job structures etc that persist in the initial list of parameters you have. What if a job runs more than one CMSSW version?

Is the same job on the Working Node running on two different CMSSW versions or resubmissions of the same job?

Chained jobs can do this. Or even independent cmsRun jobs can be packed into the same WN job. (Things with the multicore stuff will complicate this as well)

What it if has more than one input dataset?

If this has to be handled, this can be a good input of what is needed to have in the next version (if it is something needed now I can ask for a way to have it working).

There is already the two-file-problem, then stuff like pileup etc. (One of the big reasons for the dataset report is so that people can track data access at sites)

What happens if we submit with a whitelist of sites?

This should be already available (CRAB is using this). JobMeta information includes TargetCE attribute that can be a comma separated list.

Cool!

How do the different steps report in as they run?

This is being handled. Probably it is not 1 to 1 mapping with the concept of steps, but information about cmsRun execution time I see that are being handled and the same for stageout (probably for wmagent we need something for the logarchive).

I think having the concept that the job is a series of distinct steps without assuming that each job has one cmsRun, one stage out step etc is the way to go. In the past when we have introduced complex structure into the job, we have caused all kinds of issues for the dashboard, so it makes sense to bake it in at the beginning IMO.

I think starting from a blank slate with some of this stuff has some pretty major advantages. In the past the dashoboard has tried to dictate what a job looks like to us and it has caused a bunch of issues. Could we perhaps start with defining what the job process is, and working from that to generate the list of appropriate metadata to send and when? Otherwise we end up with the same kind of stalemate we have had in the production system for the last couple of years, and you are going to be slap bang in the middle of it ;-)

As said before and on the twiki I think that if we want to have it working in less then 1 month with Christmas vacations on the middle we cannot start rewriting everything from the start. My proposal is to have something that can start to work soon in order to show WMAgent jobs in a consistent way (this is the message I understood form the dmwm-integration-dataops meeting).

OK, can we converge on the cluster/node ID format in this time? I think thats probably the biggest piece to begin with. The rest is just dictionary fields and attributes, which we can pad with junk in the meantime.

Then, while we do this we can understand what's needed and what is missing, giving a good input for the next generation of the interface between WM tools and DBoard database, where we can really start from ~blank situation (keeping some constraint with the architecture and with the system). If you agree on this we might start from the schema proposed on the twiki adding on the notes (and at some point a dedicated twiki) the input for the next re-engineering (that is not something that needs 1 week of work...).

OK.

DMWMBot commented 13 years ago

mnorman: Regardless, I need to know which fields, out of all those data fields, are actually necessary and required. I also need to know what data type is expected.

DMWMBot commented 13 years ago

mnorman: Added new patch to match paradigm described in Mattia's twiki page. A lot of fields are right now set to "NotAvailable".

Using this as something so we can go forward, not as a precursor to the permanent solution, but January 15th is close enough. Did not resolve the unique JobID issue.

To run this you need the following:

New component section: config.section_("DashboardReporter") config.DashboardReporter.dashboardHost = "TESTADDRESS" config.DashboardReporter.dashboardPort = 8884

Updated workload info: monitoring.DashboardMonitor.destinationHost = "TESTADDRESS"

Mattia, I need you to send us the address of the test instance we should be using.

sfoulkes commented 13 years ago

sfoulkes: config.section_("DashboardReporter")

Shouldn't that be: config.component_("DashboardReporter")

?

cinquo commented 13 years ago

mcinquil: Replying to [comment:23 evansde]:

Replying to [comment:22 mcinquil]:

Replying to [comment:21 evansde]:

Couple of initial comments related to the stuff on the TWiki.

  1. Request name = task name. Also contains user defined fields, so having things parsed on underscores etc will break when someone comes up with _My-Request_That-Contains@SpecialCharacters.
  2. Why should the task be unique based on agent instance? The request as a whole is what people will care about, the agents actually running it are just meta information.

Because there is the need to map a CMS workflow/task to a dashboard workflow/task. You need a unique identifier and an unique name is a good idea (up to me it can also be an id, but I do not expect the workflow id in the database to be a general unique identifier). I agree and understand (also that parsing the underscore is a not very stable solution), anyway up to me it is not a problem if we want to make also tht workflow/task-name an UUID.

OK, I just think that if the CMS workload ID is the same as the dashboard workload ID you end up with a more coherent picture across the entire system. Eg, user submits request X, monitors X, looks for output from X where X is the same identifier across the entire system means it will be more usable.

I think requiring that the workload ID be globally unique is a good thing (ReqMgr does it, and its easy to add a GUID for systems that dont go through the ReqMgr) & something we should ensure in the WM System. If we can then use that for the dashboard cluster/task ID we are making progress.

Yes, the X should always be the same and then the integrity will be respected even in dashboard. Can you give me an example of the ReqMgr workload ID ?

  1. Job IDs are globally unique UUID based things in WMAgent anyway. Including the retry count is fine, but IMO you would be better off just using /request/jobID as the job name and packing everything else like retry count and wmagent name etc into the key value stuff.

As above, I agree that parsing is not the good way. In fact DBoard just need a unique name for the job to identify it in a unique way (I think the retrycount a the end of the job name has some logic) and DBoard also needs a dedicate attribute for the retry count.

OK (although I still think the idea of a resubmitted job having the same ID but going through a state cycle with retry count incrementing is a more accurate reflection of what is happening IRL)

Having a new row in the database has the advantage to keep the history of the resubmission. This is something that potentially can be improved in the next version.

Moreover, I just realized that I forgot to add the id of the jobs and the retry count in the job meta information. Now it is included in the twiki.

I realise that this means changes on the dashboard side, and we should "respect the existing infrastructure" but this is a good time to sanitise stuff and make changes as analysis, production etc all converge on the same set of conventions. Some of the features of the dashboard currently in existence (like underscores in the right places) have caused a bunch of problems in the past and should be addressed to make the system more stable/robust going forward.

I see and understand the point. I agree with you that we cannot completely force the applications (well, probably some standards should be respected) to report to the monitoring. Keep in mind that having an approach closer to WMAgent is possible. This can be done because the CMS interface to dashboard is being re-engineered and probably now it is a good time to receive inputs on what is needed and probably there are handles to implement the needed requirements.

Yup, we should document the structure and lifecycle of the WMAgent job for this. Eg, Workloads/Tasks/Jobs/Steps, the JSM etc. and pinpoint each dashboard transaction.

Ok, this something not in the current monitoring. But it can be part of some requirement being asked by CMS WM to DBoard.

There are several assumptions the dashboard has made about job structures etc that persist in the initial list of parameters you have. What if a job runs more than one CMSSW version?

Is the same job on the Working Node running on two different CMSSW versions or resubmissions of the same job?

Chained jobs can do this. Or even independent cmsRun jobs can be packed into the same WN job. (Things with the multicore stuff will complicate this as well)

Ok. I see multicore job as a set of jobs with the same scheduler id and probably some other information similar and others not.

What it if has more than one input dataset?

If this has to be handled, this can be a good input of what is needed to have in the next version (if it is something needed now I can ask for a way to have it working).

There is already the two-file-problem, then stuff like pileup etc. (One of the big reasons for the dataset report is so that people can track data access at sites)

Ok...

What happens if we submit with a whitelist of sites?

This should be already available (CRAB is using this). JobMeta information includes TargetCE attribute that can be a comma separated list.

Cool!

How do the different steps report in as they run?

This is being handled. Probably it is not 1 to 1 mapping with the concept of steps, but information about cmsRun execution time I see that are being handled and the same for stageout (probably for wmagent we need something for the logarchive).

I think having the concept that the job is a series of distinct steps without assuming that each job has one cmsRun, one stage out step etc is the way to go. In the past when we have introduced complex structure into the job, we have caused all kinds of issues for the dashboard, so it makes sense to bake it in at the beginning IMO.

I think starting from a blank slate with some of this stuff has some pretty major advantages. In the past the dashoboard has tried to dictate what a job looks like to us and it has caused a bunch of issues. Could we perhaps start with defining what the job process is, and working from that to generate the list of appropriate metadata to send and when? Otherwise we end up with the same kind of stalemate we have had in the production system for the last couple of years, and you are going to be slap bang in the middle of it ;-)

As said before and on the twiki I think that if we want to have it working in less then 1 month with Christmas vacations on the middle we cannot start rewriting everything from the start. My proposal is to have something that can start to work soon in order to show WMAgent jobs in a consistent way (this is the message I understood form the dmwm-integration-dataops meeting).

OK, can we converge on the cluster/node ID format in this time? I think thats probably the biggest piece to begin with. The rest is just dictionary fields and attributes, which we can pad with junk in the meantime.

Yes.

Then, while we do this we can understand what's needed and what is missing, giving a good input for the next generation of the interface between WM tools and DBoard database, where we can really start from ~blank situation (keeping some constraint with the architecture and with the system). If you agree on this we might start from the schema proposed on the twiki adding on the notes (and at some point a dedicated twiki) the input for the next re-engineering (that is not something that needs 1 week of work...).

OK.

sfoulkes commented 13 years ago

sfoulkes: "Moreover, as probably Julia has already told you, there is a dedicated instance where to send messages (it should be dashbaord08), so it will be easier to debug. "

cinquo commented 13 years ago

mcinquil: Replying to [comment:25 mnorman]:

Added new patch to match paradigm described in Mattia's twiki page. A lot of fields are right now set to "NotAvailable".

Ok, to test the WMAgent Dashboard Interface is enough...since from what I see the udp client has been rewritten instead of using the monalisa client (apmon, or the onw used by ProdAgent). If this works, I think will require for sure more time for testing and make it working.

Using this as something so we can go forward, not as a precursor to the permanent solution, but January 15th is close enough. Did not resolve the unique JobID issue.

Simply add a 'wmagent' in front of the job name and a 'retrycount' at the end of the job name. I suggest you to not start untill this is fixed. Be careful that what

To run this you need the following:

New component section: config.section_("DashboardReporter") config.DashboardReporter.dashboardHost = "TESTADDRESS" config.DashboardReporter.dashboardPort = 8884

Updated workload info: monitoring.DashboardMonitor.destinationHost = "TESTADDRESS"

Mattia, I need you to send us the address of the test instance we should be using.

dashboard08.cern.ch

DMWMBot commented 13 years ago

mnorman: Replying to [comment:26 sfoulkes]:

config.section_("DashboardReporter")

Shouldn't that be: config.component_("DashboardReporter")

?

Yeah. This is what I get for testing without using the Harness.

cinquo commented 13 years ago

mcinquil: Replying to [comment:29 mcinquil]:

Simply add a 'wmagent' in front of the job name and a 'retrycount' at the end of the job name. I suggest you to not start untill this is fixed. Be careful that what

I was saying 'be careful that what' you send from the agent is the same that you send from the WN.

DMWMBot commented 13 years ago

mnorman: Added patch to add retryCount to job name.

DMWMBot commented 13 years ago

mnorman: Add patch for Step reporting

sfoulkes commented 13 years ago

sfoulkes: I just ran a small Reco workflow with the dashboard reporting stuff turned on and pointed at Mattia's test dashboard instance. Request name was sfoulkes_101210_153648, it was a total of 23 jobs.

cinquo commented 13 years ago

mcinquil: Replying to [comment:34 sfoulkes]:

I just ran a small Reco workflow with the dashboard reporting stuff turned on and pointed at Mattia's test dashboard instance. Request name was sfoulkes_101210_153648, it was a total of 23 jobs.

No udp packet received since the last Steve's message. I see that the DashboardInterface class aims to be a MonaLisa's client. As I pointed out before on this ticket, this might not be trivial to rewrite from scratch. Also Dashbaord team strongly suggests to use only the Apmon client in order to send messages to Monalisa server. So, just to be sure that we could have something working I changed the Poller class to use the Apmon and I did some successful tests: messages are being received correctly by Monalisa server. This is to say that I am going to submit in few minutes a patch which aims to be working at least on sending messages to the MonaLisa server. The code in the patch is just an example and it has some value hardcoded that would need to be easily changed in case it gets integrated into svn. I suggest to do not change the apmon.py, Logger.py, ProcInfo.py files (while you might want to change the other python 'wrapper' files).

cinquo commented 13 years ago

mcinquil: Please Review

sfoulkes commented 13 years ago

sfoulkes: ApmonIf pulls in a "DashboardAPI" module that we don't have in WMCore, is this some sort of external dependency?

cinquo commented 13 years ago

mcinquil: Replying to [comment:21 evansde]:

I realise that this means changes on the dashboard side, and we should "respect the existing infrastructure" but this is a good time to sanitise stuff and make changes as analysis, production etc all converge on the same set of conventions. Some of the features of the dashboard currently in existence (like underscores in the right places) have caused a bunch of problems in the past and should be addressed to make the system more stable/robust going forward.

There are several assumptions the dashboard has made about job structures etc that persist in the initial list of parameters you have. What if a job runs more than one CMSSW version? What it if has more than one input dataset? What happens if we submit with a whitelist of sites? How do the different steps report in as they run?

I think starting from a blank slate with some of this stuff has some pretty major advantages. In the past the dashoboard has tried to dictate what a job looks like to us and it has caused a bunch of issues. Could we perhaps start with defining what the job process is, and working from that to generate the list of appropriate metadata to send and when? Otherwise we end up with the same kind of stalemate we have had in the production system for the last couple of years, and you are going to be slap bang in the middle of it ;-)

I would suggest if DMWM can have a detailed twiki containing what is neeeded to be monitored for WMAgent by dashboard, as the examples you have done in this ticket. From this an iteration can start and it will possible to see what it makes sense to be added. Then the dashboard team will think on the best way of integrating those requirements from collectors, database and -at the end- interface point of views. With this approach we might have what has been addressed with the Summary of the monitoring review.

cinquo commented 13 years ago

mcinquil: Please Review

cinquo commented 13 years ago

mcinquil: Replying to [comment:37 sfoulkes]:

ApmonIf pulls in a "DashboardAPI" module that we don't have in WMCore, is this some sort of external dependency?

Sorry, I forgot a git add of that file. Now it should have been correctly added in the last patch.

DMWMBot commented 13 years ago

mnorman: Some comments:

1) The code definitely needs to be changed so that we can pass the dashboard address in via config. I don't think we want to risk hard-coding that.

2) Can we remove some of the pieces we don't use from the DashboardAPI? In particular I want to get rid of the logger: we've got enough problems with the log files we've got without it writing a separate log to some random directory (I can't guarantee working dir).

3) What's the point of ApmonIf()? It looks like all it does is pass things directly to DashboardAPI. Shouldn't we just call DashboardAPI directly?

4) ApMon looks fairly heavy. How much of it can we get rid of, since we hardly use any of that functionality?

evansde77 commented 13 years ago

evansde: A couple of comments/bit of history.

apmon.py was basically a command line tool that didnt work well as a library back in the day when I first used it in the PA, which is why we ended up with our own UDP packet maker and broadcaster.

If you want to switch from our own stuff to apmon.py, then apmon.py needs to be packaged as an external and we need to ensure that we have a set of unittests that make sure it behaves in a threaded environment.

sfoulkes commented 13 years ago

sfoulkes: As for 1, we need to remove the dashboard address from the workflow generation code and move it somewhere else. I have no idea how we're going to get it into the runtime stuff without hardcoding it...

evansde77 commented 13 years ago

evansde: Replying to [comment:43 sfoulkes]:

As for 1, we need to remove the dashboard address from the workflow generation code and move it somewhere else. I have no idea how we're going to get it into the runtime stuff without hardcoding it...

The WMWorkload exists on the WN and gets loaded by the TaskSpace stuff IIRC... should be there.

sfoulkes commented 13 years ago

sfoulkes: The question is how do we get it into the workload. Do we have a configuration option in the ReqMgr config that adds this value to the workload when it's created? This probably isn't a big deal, once the bugs are worked out we'll always be reporting to the same dashboard server.

cinquo commented 13 years ago

mcinquil: Replying to [comment:41 mnorman]:

Some comments:

1) The code definitely needs to be changed so that we can pass the dashboard address in via config. I don't think we want to risk hard-coding that.

Obviously the parameters have to be configurables, mine was an example to see it things were working with it and to give you an example.

2) Can we remove some of the pieces we don't use from the DashboardAPI? In particular I want to get rid of the logger: we've got enough problems with the log files we've got without it writing a separate log to some random directory (I can't guarantee working dir).

yes

3) What's the point of ApmonIf()? It looks like all it does is pass things directly to DashboardAPI. Shouldn't we just call DashboardAPI directly?

Probably it can be removed (it was doing some other kind of check in case some information was missing, but this can be already assured by the component).

4) ApMon looks fairly heavy. How much of it can we get rid of, since we hardly use any of that functionality?

I do not understand what do you mean with heavy: too long? too many parameters? If you are worried about performances, then I suggest you to look and to remove what it is not useful (without breacking the current functionalities).

evansde77 commented 13 years ago

evansde: Replying to [comment:45 sfoulkes]:

The question is how do we get it into the workload. Do we have a configuration option in the ReqMgr config that adds this value to the workload when it's created? This probably isn't a big deal, once the bugs are worked out we'll always be reporting to the same dashboard server.

Should be a sensible default in stdspec, possibly with ability to override if needed (not sure of the need?) very similar to what we do with DBS URLs. (Eg: default to DBSGlobal URL)

DMWMBot commented 13 years ago

mnorman: Replying to [comment:46 mcinquil]:

Replying to [comment:41 mnorman]:

4) ApMon looks fairly heavy. How much of it can we get rid of, since we hardly use any of that functionality?

I do not understand what do you mean with heavy: too long? too many parameters? If you are worried about performances, then I suggest you to look and to remove what it is not useful (without breacking the current functionalities).

I think my question is: What does ApMon give us that ApMonLite doesn't, or for that matter that my stripped down interface link doesn't give us?

I think we're going to have to rewrite it anyway to get rid of all calls to threading.Lock() and threading.Event() (which I don't dare touch). Is it worth it to rewrite ApMon at this point? What does it give us that a simple socket.connect doesn't?

drsm79 commented 13 years ago

metson: Replying to [comment:48 mnorman]:

I think my question is: What does ApMon give us that ApMonLite doesn't, or for that matter that my stripped down interface link doesn't give us?

Well it sounds like one thing is that it works now...

Replying to [comment:48 mnorman]:

I think we're going to have to rewrite it anyway to get rid of all calls to threading.Lock() and threading.Event() (which I don't dare touch). Is it worth it to rewrite ApMon at this point? What does it give us that a simple socket.connect doesn't?

I think that's a sensible long term goal, but there's a shorter term issue which is getting something (anything!) reported to Dashboard by ~mid January. I think for that the pragmatic approach of using ApMon that Mattia has suggested is sensible. We then see how that breaks, the issues it raises and then have information on what the rewrite of ApMon needs to do.

spigad commented 13 years ago

spiga: > Replying to [comment:48 mnorman]:

I think we're going to have to rewrite it anyway to get rid of all calls to threading.Lock() and threading.Event() (which I don't dare touch). Is it worth it to rewrite ApMon at this point? What does it give us that a simple socket.connect doesn't?

I think that's a sensible long term goal, but there's a shorter term issue which is getting something (anything!) reported to Dashboard by ~mid January. I think for that the pragmatic approach of using ApMon that Mattia has suggested is sensible. We then see how that breaks, the issues it raises and then have information on what the rewrite of ApMon needs to do.

that's sound like a pragmatic approach which may help now.

DMWMBot commented 13 years ago

mnorman: Replying to [comment:49 metson]:

Replying to [comment:48 mnorman]:

I think my question is: What does ApMon give us that ApMonLite doesn't, or for that matter that my stripped down interface link doesn't give us?

Well it sounds like one thing is that it works now...

Replying to [comment:48 mnorman]:

I think we're going to have to rewrite it anyway to get rid of all calls to threading.Lock() and threading.Event() (which I don't dare touch). Is it worth it to rewrite ApMon at this point? What does it give us that a simple socket.connect doesn't?

I think that's a sensible long term goal, but there's a shorter term issue which is getting something (anything!) reported to Dashboard by ~mid January. I think for that the pragmatic approach of using ApMon that Mattia has suggested is sensible. We then see how that breaks, the issues it raises and then have information on what the rewrite of ApMon needs to do.

1) ApMonLite works right now, because it's what we're using on the worker node side (ApMon is too heavy to run on the worker nodes). It already has a WMCore implementation, and Julia has already complained about it (which means that it works). It also runs the ProdAgent as near as I can tell. We're not debating between one working and one prototype, we're debating between two working systems.

2) The time we save by working on ApMon will probably be lost in rewriting the entire WMCore component infrastructure. Spawning threads within threads is not guaranteed to be safe. Spawning threads from within a thread that subsequently call lock() is fundamentally unsound; it's my understanding that dual-layer threading requires substantial rewrites of the python global interpreter. Of course, it might work as long as the Harness/WorkerThread system doesn't attempt to call lock() itself, I'm unclear about what happens when you violate the system paradigm so vigorously.

But I am not at all convinced that we should be introducing fundamentally unsound software into the stack when we have working, viable alternatives. This is why I want to know what ApMon gives us besides a lot of rewriting work that has to be done later.

evansde77 commented 13 years ago

evansde: I went and looked into the patch and looked at the apmon.py stuff, and my opinion is unchanged.

If we are going to use it the UDP packet build & send needs to come out into its own library, there is just too much baggage in there.

If apmon.py can be refactored to move the UDP building & sending into a standalone library with no threading, logging (threaded logging????) or ProcInfo dependencies, then we can use it. Now that the ProcInfo stuff has been factored out there shouldnt be any objections on the lines of keeping it as a single script.

Otherwise we are at the same impasse we were a couple of years ago.

cinquo commented 13 years ago

mcinquil: DashboardInterface is the solution provided by the Matt's patch in this ticket and committed into svn. That is another UDP client (no ApMon, no ApMonLite) that runs fine as WM component (trying to send NULL attributes), but it simply does not send valid messages to the MonaLisa server (no mesasges received by the server), which has been the work it should do. It has been tested by Steve and then by me, without any success. Then I proposed a working example (at least to my WMAgent instance) using the ApMon client, the official client for MonaLisa server. Has someone else tried it? There are reasons for using an official version:

Then, if you have another well working solution (ApMonLite?) why are we writing again from scratch and wasting people time on testing the DashboardInterface UDP client? (We lost on this more then 1 week just before Christmas vacation which is arrived, and the code in svn is still not working). And why isn't the DashboardReporter component in svn using ApMon(Lite) instead of DashboardInterface?

Dave, I think you can put the problems that you see on having the official ApMon client in WMAgent (logging? threading?) in the dashboard requirement twiki asked to you last week. Error messages or even tracebacks that you got while trying ApMon are also a good feedback. Please, send it directly to Julia cc'ing me (she already knows about WMAgent). Then I am sure that Dashboard team can look at it together with MonaLisa experts and if needed provide a fixed version of ApMon client (just commenting the thread calls? note: there shouldn't be any multithreaded execution on the example I proposed).

In the while, as also said to Dave last week, it could be the case to be pragmatic and to start pushing information to dashboard server by using a working solution, also starting to send not NULL information. From that point the work on dashboard side can start.

cinquo commented 13 years ago

mcinquil: Replying to [comment:49 metson]:

I think that's a sensible long term goal, but there's a shorter term issue which is getting something (anything!) reported to Dashboard by ~mid January. I think for that the pragmatic approach of using ApMon that Mattia has suggested is sensible. We then see how that breaks, the issues it raises and then have information on what the rewrite of ApMon needs to do.

I fully agree with this.

evansde77 commented 13 years ago

evansde: Requirements Document: https://twiki.cern.ch/twiki/bin/view/CMS/WMCoreDiscussDashboardReqs

DMWMBot commented 13 years ago

mnorman: A third version of the DashboardReporter, which should apply to TRUNK. This one uses ApMonLite. I don't think it should be committed, but it provides the third option.

Our three options look like this:

1) Use the DashboardInterface. It's the most stripped down, and the fastest, but apparently it has some problems. Since it seems to be opening the socket connection properly, our ability to fix it depends on MonaLisa debugging. If that's no good, we're basically stuck there.

2) Use ApMonLite. That works in ProdAgent, and when we turned it on by accident in WMAgent we got complaints from Dashboard, so it probably sends the right information. We already have that code in WMAgent, but it'll have to be tested to see if it's broadcasting what we want. We already have to support ApMonLite, so that solves the support problem, but I don't have speed tests for it.

3) Use ApMon. Looking over it, the modifications won't be that bad. References to the global locks will have to be removed from the send functions, and all uses of the logger will have to be removed. The problem is that then creates a parallel ApMon; I'm not sure that the Dashboard group wants to maintain something like that.

cinquo commented 13 years ago

mcinquil: Replying to [comment:57 mnorman]:

A third version of the DashboardReporter, which should apply to TRUNK. This one uses ApMonLite. I don't think it should be committed, but it provides the third option.

Our three options look like this:

1) Use the DashboardInterface. It's the most stripped down, and the fastest, but apparently it has some problems. Since it seems to be opening the socket connection properly, our ability to fix it depends on MonaLisa debugging. If that's no good, we're basically stuck there.

This doesn't work.

2) Use ApMonLite. That works in ProdAgent, and when we turned it on by accident in WMAgent we got complaints from Dashboard, so it probably sends the right information. We already have that code in WMAgent, but it'll have to be tested to see if it's broadcasting what we want. We already have to support ApMonLite, so that solves the support problem, but I don't have speed tests for it.

I do not if it works fine, but if yes...why not?

3) Use ApMon. Looking over it, the modifications won't be that bad. References to the global locks will have to be removed from the send functions, and all uses of the logger will have to be removed. The problem is that then creates a parallel ApMon; I'm not sure that the Dashboard group wants to maintain something like that.

I am going to attach a working version of apmon stripped from threading, customized logging or whatever like this. It has still a DashboardAPI file that can be easily removed. This version can still be stripped of something, but it is almost everything stripped.

DMWMBot commented 13 years ago

mnorman: I think I put together your patch correctly. It seems to work, but I have a question.

1) Why is there a time.sleep(1) in apmonFree? 2) If that has to be there, do we have to free after every connection. Can't we just free it after running a whole sequence of job messages?

cinquo commented 13 years ago

mcinquil: The sleep can be removed.

In principle the DashboardAPI files can changed (probably including the needed parts in DashboardReporterPoller or wherever you prefer, and removing that file if not needed...which in principle is just a wrapper of apmon).