aps-8id-dys / ipython-8idiuser

8-ID-I ipython configuration for bluesky (and other)
1 stars 1 forks source link

DM_support workflow stops processing with no error shown #86

Closed prjemian closed 4 years ago

prjemian commented 4 years ago

After some period of time (half hour approx) this past week in which the data management workflow is running (from spec), the workflow stops queueing jobs but does not provide any printed output as to why. Peculiar is that it initially ran for days without trouble.

prjemian commented 4 years ago

Another aspect is that one time, when the helper seemed to stall, the user quit the helper with ^C, then tried to restart. The helper reported some problem since the trigger value was 1 as it started. User could not set the PV value back to zero since SPEC was holding that value at 1. Only by quitting SPEC was user able to reset the trigger PV to zero, the start the helper.

prjemian commented 4 years ago

The helper logs information in this diretcory: /home/beams/8IDIUSER/.ipython-bluesky/profile_bluesky/startup/spec_support/.logs

It would be nice to combine these with the normal session logs. Had trouble getting the different threads to log into the main logger. This is a workaround.

prjemian commented 4 years ago

The .workflow_helper.log file shows these lines about the time this happened:

|2019-10-11 22:08:41.090|INFO|spec_DM_support|spec_DM_support|288|MainThread| - before starting data management workflow
|2019-10-11 22:08:41.114|INFO|spec_DM_support|spec_DM_support|293|MainThread| - after starting data management workflow (0.023s)
|2019-10-11 22:08:41.114|INFO|spec_DM_support|spec_DM_support|294|MainThread| - workflow file: /home/8-id-i/2019-3/qzhang201910/O026_Silica_D100_att5_Rq0_00005/O026_Silica_D100_att5_Rq0_00005_0001-100000.hdf
|2019-10-11 22:08:44.886|INFO|spec_DM_support|spec_DM_support|288|MainThread| - before starting data management workflow
|2019-10-11 22:08:44.909|INFO|spec_DM_support|spec_DM_support|293|MainThread| - after starting data management workflow (0.023s)
|2019-10-11 22:08:44.909|INFO|spec_DM_support|spec_DM_support|294|MainThread| - workflow file: /home/8-id-i/2019-3/qzhang201910/O026_Silica_D100_att5_Rq0_00006/O026_Silica_D100_att5_Rq0_00006_0001-100000.hdf
|2019-10-11 22:22:45.913|INFO|spec_DM_support|spec_DM_support|262|MainThread| - workflow helper starting
|2019-10-11 22:23:02.011|INFO|spec_DM_support|spec_DM_support|262|MainThread| - workflow helper starting
|2019-10-11 22:25:39.397|INFO|spec_DM_support|spec_DM_support|262|MainThread| - workflow helper starting
|2019-10-11 22:28:45.714|INFO|spec_DM_support|spec_DM_support|262|MainThread| - workflow helper starting
|2019-10-11 22:29:29.424|INFO|spec_DM_support|spec_DM_support|288|MainThread| - before starting data management workflow
|2019-10-11 22:29:29.448|INFO|spec_DM_support|spec_DM_support|293|MainThread| - after starting data management workflow (0.024s)
|2019-10-11 22:29:29.448|INFO|spec_DM_support|spec_DM_support|294|MainThread| - workflow file: /home/8-id-i/2019-3/qzhang201910/O029_Silica_D100_att5_Rq0_00001/O029_Silica_D100_att5_Rq0_00001_0001-100000.hdf
|2019-10-11 22:29:33.274|INFO|spec_DM_support|spec_DM_support|288|MainThread| - before starting data management workflow
|2019-10-11 22:29:33.297|INFO|spec_DM_support|spec_DM_support|293|MainThread| - after starting data management workflow (0.023s)
|2019-10-11 22:29:33.298|INFO|spec_DM_support|spec_DM_support|294|MainThread| - workflow file: /home/8-id-i/2019-3/qzhang201910/O029_Silica_D100_att5_Rq0_00002/O029_Silica_D100_att5_Rq0_00002_0001-100000.hdf
|2019-10-11 22:29:36.979|INFO|spec_DM_support|spec_DM_support|288|MainThread| - before starting data management workflow
|2019-10-11 22:29:37.003|INFO|spec_DM_support|spec_DM_support|293|MainThread| - after starting data management workflow (0.023s)
|2019-10-11 22:29:37.004|INFO|spec_DM_support|spec_DM_support|294|MainThread| - workflow file: /home/8-id-i/2019-3/qzhang201910/O029_Silica_D100_att5_Rq0_00003/O029_Silica_D100_att5_Rq0_00003_0001-100000.hdf

note processes O027 and O028 are missing from the logs

prjemian commented 4 years ago

Python logging package documentation: https://docs.python.org/3/library/logging.html

prjemian commented 4 years ago

It may be time for use of the RotatingFileHandler feature.

prjemian commented 4 years ago

The RotatingFileHandler will switch to a new file when the current size will exceed a specified threshold. It can also limit the number of files in the rotation.

Based on what we've seen so far, I suggest a log file size of 1 MB and a rotation of no more than 9 files.

To implement this, I plan to change the support library with optional parameters that will choose this file handler as the options describe.

prjemian commented 4 years ago

something like this: standard_logging_setup(logger_name, maxBytes=1024*1024, backupCount=9)

prjemian commented 4 years ago

Testing the revised logging code (35,000 calls, maxBytes=1024*1024, backupCount=5) shows about 0.6 ms per logged line of 176 characters, including file size check and occasional file rotations.

prjemian commented 4 years ago

conda update -c aps-anl-tag stdlogpj should be v1.0.2

prjemian commented 4 years ago

updated the files for account 8idiuser

@qzhang234 : Can you restart the helper code (and that will start to use the new log file size management options). @sureshnaps and I agreed today that the workflow helper may have appeared stalled since the log files have grown to enormous proportions. This is the fix.

prjemian commented 4 years ago

There are a couple more instances of logging in the SPEC workflow helper. Re-opening for those.

qzhang234 commented 4 years ago

There seems to be an error when I tried to start the helper code:

Screen Shot 2019-10-14 at 10 44 52 PM
prjemian commented 4 years ago

@qzhang234 : You are too quick, did not expect you at work now. I'm still editing and there is more to be done. Seems I implemented direct file update since the logging was not working as I expected. The documentation for logging says it is supposed to work the same in threads as in the main program. I'll switch to that now.

prjemian commented 4 years ago

Here's the fix for the error you reported:

BYTE = 1
kB = 1024 * BYTE
MB = 1024*kB
qzhang234 commented 4 years ago

@prjemian It's Ok, I can do it tomorrow. The alignment will take about an hour so take your time. @sureshnaps To switch to Pete's workflow, do I go to ccd_settings_oldstorage.do (the one in Rigaku_8m_Suresh.do) and change 'Use_SPEC_DM_Workflow' to 0?

qzhang234 commented 4 years ago

@prjemian It's now complaining that it can't find conda env 'start' but the helper code did not crash. Shall I leave it as it is?

Screen Shot 2019-10-14 at 10 54 42 PM
prjemian commented 4 years ago

@qzhang234 : That's the wrong program, type: ./daemon.sh start

I'm ready now for you to try it

prjemian commented 4 years ago

@qzhang234 @sureshnaps What time will you start in the morning?

qzhang234 commented 4 years ago

Yes, now it executed without error and the helper box is updating.

We plan to start around 9 or 9:30

prjemian commented 4 years ago

not closing the issue until we are convinced it is resolved

prjemian commented 4 years ago

I'll join you as soon as I get there. Laws of physics (and state of Illinois) predict that is ~9:30. -ish.

prjemian commented 4 years ago

Oh goody!

prjemian commented 4 years ago

Workflow helper logs here:

/home/beams/8IDIUSER/.ipython-bluesky/profile_bluesky/startup/spec_support/.logs/workflow_helper.log
prjemian commented 4 years ago

Looks like this:

|2019-10-14 22:54:39.812|INFO|30475|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
|2019-10-14 22:54:39.812|INFO|30475|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
|2019-10-14 23:02:50.686|INFO|32043|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
|2019-10-14 23:02:50.686|INFO|32043|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
|2019-10-14 23:03:37.786|INFO|32191|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
|2019-10-14 23:03:37.786|INFO|32191|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting

Why identical, double entries?

prjemian commented 4 years ago

Ok, I understand, there are two loggers writing to this. I'll need to make one of them look different from the other to identify which to comment out. Probably this one: https://github.com/aps-8id-trr/ipython-8idiuser/blob/4c1ff36ad4039a6c0ac2ccfea17a557aecfd9d1f/profile_bluesky/startup/spec_support/APS_DM_8IDI.py#L36-L39

prjemian commented 4 years ago

now: https://github.com/aps-8id-trr/ipython-8idiuser/blob/6296599e7c7fc412debd6b33ad451319f076a331/profile_bluesky/startup/spec_support/APS_DM_8IDI.py#L35-L39

helper needs a restart: ./daemon.sh restart

sureshnaps commented 4 years ago

@qzhang234 Yes, set Use_SPEC_DM_Workflow = 0 Note that there is one initialization in the top of the ccd_settings file and there is one under copy_local_data, change both, definitely the one in copy_local_data

There is a copy_local_data_old which you can leave it as is.

qzhang234 commented 4 years ago

@prjemian Helper is currently running. Should I kill it (kill pid) and restart?

qzhang234 commented 4 years ago

I switched Use_SPEC_DM_Workflow = 0 at both locations (the one at front was at 0 all the time and somehow it runs), but all jobs are failing. Not sure if I'm starting the workflow wrong.

The job index is Z056.

Switching back to the old Spec workflow for now.

prjemian commented 4 years ago

Workflow log file is updating now. Using the helper? Looks like jobs are processing. The HTML process view window is updating with completed jobs.

All ok now?

On Tue, Oct 15, 2019, 9:15 AM qzhang234 notifications@github.com wrote:

I switched Use_SPEC_DM_Workflow = 0 at both locations (the one at front was at 0 all the time and somehow it runs), but all jobs are failing. Not sure if I'm starting the workflow wrong.

The job index is Z056.

Switching back to the old Spec workflow for now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AARMUMEI7TKQIANMA7EYU53QOXGALA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBI5QCQ#issuecomment-542234634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMFORMKGK5USQIXJAC3QOXGALANCNFSM4JATGAFQ .

qzhang234 commented 4 years ago

@prjemian @sureshnaps The DM Workflow stalled at the same stage while the helper is alive and running. I'm switching to the old Spec workflow for now.

Screen Shot 2019-10-15 at 12 52 29 PM
sureshnaps commented 4 years ago

@prjemian how hard it is to call the python code that calls the new metadata creation, without the helper. Is this a step we can take in the interim?

prjemian commented 4 years ago

Not hard to call that code but no advantage either. The point of the helper is to connect (the step that takes the longest) with all the EPICS PVs before you need to read them. Calling the connect code after each acquisition is what slows down the process.

I just updated the code around <trigger_PV>.put(0) to ensure it does not try to wait AND timeout of no more than 1s. Can you try again?

prjemian commented 4 years ago

Here are the last two acquisitions as logged by the helper:

|2019-10-15 12:37:35.274|DEBUG|14167|spec_DM_support|spec_DM_support|284|MainThread| - workflow handling triggered
|2019-10-15 12:37:35.274|INFO|14167|spec_DM_support|spec_DM_support|293|MainThread| - calling start_workflow(analysis=1.0)
|2019-10-15 12:37:35.397|INFO|14167|spec_DM_support|spec_DM_support|298|MainThread| - after starting data management workflow (0.123s)
|2019-10-15 12:37:35.397|INFO|14167|spec_DM_support|spec_DM_support|299|MainThread| - workflow file: /net/s8iddata-old/export/old_8-id-i/2019-3/rinaldi201910/E072_HA_0.04_0_15_180nm_att3_Rq0_00018/E072_HA_0.04_0_15_180nm_att3_Rq0_00018_0001-100000.hdf
|2019-10-15 12:37:35.398|DEBUG|14167|spec_DM_support|spec_DM_support|301|MainThread| - reset trigger: 1.0 (should be '0')
|2019-10-15 12:37:40.200|DEBUG|14167|spec_DM_support|spec_DM_support|284|MainThread| - workflow handling triggered
|2019-10-15 12:37:40.202|INFO|14167|spec_DM_support|spec_DM_support|293|MainThread| - calling start_workflow(analysis=1.0)
|2019-10-15 12:37:40.322|INFO|14167|spec_DM_support|spec_DM_support|298|MainThread| - after starting data management workflow (0.120s)
|2019-10-15 12:37:40.322|INFO|14167|spec_DM_support|spec_DM_support|299|MainThread| - workflow file: /net/s8iddata-old/export/old_8-id-i/2019-3/rinaldi201910/E072_HA_0.04_0_15_180nm_att3_Rq0_00019/E072_HA_0.04_0_15_180nm_att3_Rq0_00019_0001-100000.hdf
|2019-10-15 12:37:40.323|DEBUG|14167|spec_DM_support|spec_DM_support|301|MainThread| - reset trigger: 1.0 (should be '0')
sureshnaps commented 4 years ago

I see, I get it. We will try it at the next sample start.

qzhang234 commented 4 years ago

@prjemian @sureshnaps The current users intended to run as it is till the end. Maybe we can test it next Monday when the beam is down?

prjemian commented 4 years ago

Ok. Agreed they have been tortured enough.

prjemian commented 4 years ago

Latest version is running now and should be as ready as we can.

|2019-10-15 13:46:10.010|INFO|22587|spec_DM_support|spec_DM_support|266|MainThread| - workflow helper starting
sureshnaps commented 4 years ago

@prjemian The original version that produced multiple files worked very well. Only problem was the multiple files. One thought: just call it once and not worry about multiple times. The original code was meant to do it once as it is meant for real time metadata which will be only once for a dataset. The subsequent calls should just ignore and return without doing anything if the file exists.

Will this strategy help or the problem is something else?

prjemian commented 4 years ago

I'll come around about 2. This is one approach to try.

Suresh N. wrote:

@prjemian https://github.com/prjemian The original version that produced multiple files worked very well. Only problem was the multiple files. One thought: just call it once and not worry about multiple times. The original code was meant to do it once as it is meant for real time metadata which will be only once for a dataset. The subsequent calls should just ignore and return without doing anything if the file exists.

Will this strategy help or the problem is something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AARMUMEDA354I7YTN76KKZ3QPSVNLA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYRTWQ#issuecomment-544283098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMAELZC53VKF5G6D6LDQPSVNLANCNFSM4JATGAFQ.

sureshnaps commented 4 years ago

Pete, Sure, thanks. At 2-30, we have a beamline Ops meeting for which we would disappear for 30 min or so. Do you want to do after 3? Suresh

From: Pete R Jemian notifications@github.com Reply-To: aps-8id-trr/ipython-8idiuser reply@reply.github.com Date: Monday, October 21, 2019 at 1:11 PM To: aps-8id-trr/ipython-8idiuser ipython-8idiuser@noreply.github.com Cc: "Narayanan, Suresh" sureshn@anl.gov, Mention mention@noreply.github.com Subject: Re: [aps-8id-trr/ipython-8idiuser] DM_support workflow stops processing with no error shown (#86)

I'll come around about 2. This is one approach to try.

Suresh N. wrote:

@prjemian https://github.com/prjemian The original version that produced multiple files worked very well. Only problem was the multiple files. One thought: just call it once and not worry about multiple times. The original code was meant to do it once as it is meant for real time metadata which will be only once for a dataset. The subsequent calls should just ignore and return without doing anything if the file exists.

Will this strategy help or the problem is something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AARMUMEDA354I7YTN76KKZ3QPSVNLA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYRTWQ#issuecomment-544283098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMAELZC53VKF5G6D6LDQPSVNLANCNFSM4JATGAFQ.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AAYSAQ22MQZUOVHCUGLSDU3QPXWBZA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3IKVA#issuecomment-544638292, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYSAQ636MXYDLX2CEMR7XTQPXWBZANCNFSM4JATGAFQ.

prjemian commented 4 years ago

Ok

On Mon, Oct 21, 2019, 1:15 PM Suresh N. notifications@github.com wrote:

Pete, Sure, thanks. At 2-30, we have a beamline Ops meeting for which we would disappear for 30 min or so. Do you want to do after 3? Suresh

From: Pete R Jemian notifications@github.com Reply-To: aps-8id-trr/ipython-8idiuser reply@reply.github.com Date: Monday, October 21, 2019 at 1:11 PM To: aps-8id-trr/ipython-8idiuser ipython-8idiuser@noreply.github.com Cc: "Narayanan, Suresh" sureshn@anl.gov, Mention < mention@noreply.github.com> Subject: Re: [aps-8id-trr/ipython-8idiuser] DM_support workflow stops processing with no error shown (#86)

I'll come around about 2. This is one approach to try.

Suresh N. wrote:

@prjemian https://github.com/prjemian The original version that produced multiple files worked very well. Only problem was the multiple files. One thought: just call it once and not worry about multiple times. The original code was meant to do it once as it is meant for real time metadata which will be only once for a dataset. The subsequent calls should just ignore and return without doing anything if the file exists.

Will this strategy help or the problem is something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AARMUMEDA354I7YTN76KKZ3QPSVNLA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYRTWQ#issuecomment-544283098>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AARMUMAELZC53VKF5G6D6LDQPSVNLANCNFSM4JATGAFQ>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AAYSAQ22MQZUOVHCUGLSDU3QPXWBZA5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3IKVA#issuecomment-544638292>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AAYSAQ636MXYDLX2CEMR7XTQPXWBZANCNFSM4JATGAFQ>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aps-8id-trr/ipython-8idiuser/issues/86?email_source=notifications&email_token=AARMUMEBEK536Y2IF5XD4VTQPXWV5A5CNFSM4JATGAF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3I33Q#issuecomment-544640494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMEELZU5LLJ47ZXZ26LQPXWV5ANCNFSM4JATGAFQ .

prjemian commented 4 years ago

@sureshnaps : Following up on your comment, the code before only called the data management process multiple times because it ran through the same loop multiple times for a single trigger. Now, it only executes that loop once per trigger, based on detecting the trigger's leading edge.

prjemian commented 4 years ago

Problem discovered in spec_DM_support.py was a syntax error that caused python process to execute with a traceback. since this was executing from shell script, the error was not reported directly to the console. And, the console, running in a screen session was immediately closed when the python process ended.

in short: the error message did not print

error was found by running helper.py in a console window, after activating the bluesky conda environment.

Suggest continue to run helper.py in a console window until the software is proven stable.