Closed WeatherGod closed 6 years ago
I want to try out v6.13.7 to see if it solves a problem of mine on reboot, but I can't build it without a configure script.
I found CI-commit.sh and am trying to go that approach to make my own configure. I'll point out that following those instructions exactly yields:
-bash-4.1$ autoreconf -if
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `build-aux'.
libtoolize: copying file `build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIR, `m4'.
libtoolize: copying file `m4/libtool.m4'
libtoolize: copying file `m4/ltoptions.m4'
libtoolize: copying file `m4/ltsugar.m4'
libtoolize: copying file `m4/ltversion.m4'
libtoolize: copying file `m4/lt~obsolete.m4'
configure.ac:83: installing `build-aux/compile'
configure.ac:82: installing `build-aux/config.guess'
configure.ac:82: installing `build-aux/config.sub'
configure.ac:11: installing `build-aux/install-sh'
configure.ac:11: installing `build-aux/missing'
fauxPq/Makefile.am: installing `build-aux/depcomp'
mcast_lib/Makefile.am:9: required directory mcast_lib/FMTP-LDM7/UnidataFMTP does not exist
autoreconf: automake failed with exit status: 1
If I make that directory myself, then it completes making a configure script.
Is there a guide that explains this stuff better? The autoreconf is making a makefile that requires doxygen to complete an install.
A few things:
What's the problem you're seeing on reboot with v6.13.6?
The bug I am having is the pid file is not getting cleaned up on a clean reboot of the machine, and it prevents the LDM from starting up properly. I am guessing that 3aa162 might address it, so I wanted to test it out to confirm it before officially deploying a fix to my production systems.
On Sat, Jan 13, 2018 at 12:32 PM, Steven Emmerson notifications@github.com wrote:
A few things:
- Version 6.13.7 isn't ready; and
- The code repository is intended for software developers only. Development requires installation of Doxygen, tcl, CUNIT, Google Test, etc. If you don't intend to develop the package, then you should definitely use the distribution instead.
- There is no development guide because that's too low a priority.
What's the problem you're seeing on reboot with v6.13.6?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/LDM/issues/65#issuecomment-357452066, or mute the thread https://github.com/notifications/unsubscribe-auth/AARy-H2SscNQdiyXGmwIFjV_YqVFTfERks5tKOjCgaJpZM4Rc6K9 .
If the file $HOME/ldmld.pid
is preventing the LDM system from starting when the computer is rebooted, then you should add the command ldmadmin clean
to the boot-time script for the LDM before the LDM is started. Be sure to execute it as the LDM user.
That's all fine and dandy (although, I would rather not drop my existing queue after a weekly reboot after security updates). My question was, would that commit prevent the problem in the first place?
Assuming that by "that commit" you mean the not-yet-released version 6.13.7, then no: that version doesn't do anything different with $HOME/ldmd.pid
.
That file should be removed by the command ldmadmin stop
. If it's not, then it's likely that command wasn't used to stop the LDM system. You should ensure that it is.
By "that commit", I mean 3aa162adba
We are doing a normal system reboot, which should go through the normal shutdown process for all services. I figure that the service is getting the signal for shutdown, but not properly executing shutdown, which includes the cleanup of the pid file. If that is not the case, then there is some other bug that needs to be fixed.
I was hoping to confirm whether or not the bug was fixed by what was in master before filing a bug report.
It's easy enough to check whether or not an ldmadmin stop
removes the PID file: just execute it manually. Let me know if it doesn't remove the file. If it does remove the file, however, then there's a problem with your shutdown procedure.
Indeed, that seems to work fine.
My system is a stock CentOS 6. And the system shutdown was triggered by a manual shutdown -r now
. Here is the relevant ldmd log entry for that unclean ldm shutdown:
20180116T154504.048974Z pqact[3892] NOTE pqact.c:111:cleanup() Exiting
20180116T154504.049099Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.049176Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050738Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.050803Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050874Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.050916Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050987Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.051043Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.051808Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.051869Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.051994Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.052047Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.052129Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.052174Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.054827Z mrms-ldmout.ncep.noaa.gov[3897] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.054887Z mrms-ldmout.ncep.noaa.gov[3897] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.056733Z mrms-ldmout.ncep.noaa.gov[3899] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.056799Z mrms-ldmout.ncep.noaa.gov[3899] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.058140Z mrms-ldmout.ncep.noaa.gov[3900] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.058201Z mrms-ldmout.ncep.noaa.gov[3900] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.059828Z ldmd[3890] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.059964Z ldmd[3890] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.060207Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 5) failed
20180116T154504.060267Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 6) failed
20180116T154504.060291Z ldmd[3890] NOTE ldmd.c:258:cleanup() Terminating process group
20180116T154504.060527Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.062972Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.059964Z ldmd[3890] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.060207Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 5) failed
20180116T154504.060267Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 6) failed
20180116T154504.060291Z ldmd[3890] NOTE ldmd.c:258:cleanup() Terminating process group
20180116T154504.060527Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.062972Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.064730Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.095231Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.096755Z pqact[3892] NOTE pqact.c:128:cleanup() Behind by 12.479 s
20180116T154504.097370Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.103982Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.104973Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:306:signal_handler() SIGTERM received
Then, we have a few of the follow entries in the log periodically, until I finally realized that the LDM didn't start up (yes, I know how to get notified, but our emails weren't working at that moment).
20180116T160001.787528Z uldbutil[1865] NOTE uldb.c:1071:sm_setShmId() No such file or directory
20180116T160001.789038Z uldbutil[1865] NOTE uldb.c:1071:sm_setShmId() Couldn't get shared-memory segment identifier
20180116T160001.789059Z uldbutil[1865] NOTE uldb.c:1095:sm_attach() Couldn't get shared-memory segment
20180116T160001.789070Z uldbutil[1865] NOTE uldb.c:1176:sm_init() Couldn't attach shared-memory segment
20180116T160001.789079Z uldbutil[1865] NOTE uldb.c:1970:uldb_open() Couldn't initialize shared-memory component
20180116T160001.789087Z uldbutil[1865] NOTE uldbutil.c:95:main() The upstream LDM database doesn't exist
20180116T160001.789096Z uldbutil[1865] NOTE uldbutil.c:96:main() Is the LDM running?
The multiple "Terminating process group" messages in the log file indicate that more than one SIGTERM was sent to the top-level LDM server.
The lack of "Exiting" messages from multiple downstream LDM processes indicates that the shutdown procedure didn't wait for the LDM system to terminate.
It's likely that one or both of these behaviors is the cause of the problem you're encountering.
What could have sent more than one SIGTERM? Like I said, I am using a bare CentOS 6 VM. It is used just for this one LDM. I don't have anything non-standard (that I am aware of). Is there anything else I can do to help you diagnose/debug this problem?
The timestamps on the duplicated messages from the top-level LDM server (PID 3890) are identical to the microsecond. It would seem, therefore, that you have a logging or cut-and-paste problem.
I would be more concerned about the lack of "Exiting" message because it indicates that the shutdown procedure didn't wait for the "ldmadmin stop" process to terminate. Terminating that process prematurely would result in the file "ldmd.pid" not being deleted as well as the other problems you've encountered.
I recommend investigating your shutdown procedure.
"investigating your shutdown procedure" --- ????
What do you mean by shutdown procedure? Something wrong with shutdown -r now
? Or something wrong with sysvinit (or whatever the system is called
nowadays)? Or something wrong with the shutdown script for ldm? There is
nothing custom about my system. It is a stock CentOS 6 VM. And I installed
LDM from the source tarball available from the FTP site as-is with no
modifications.
On Mon, Jan 22, 2018 at 2:39 PM, Steven Emmerson notifications@github.com wrote:
The timestamps on the duplicated messages from the top-level LDM server (PID 3890) are identical to the microsecond. It would seem, therefore, that you have a logging or cut-and-paste problem.
I would be more concerned about the lack of "Exiting" message because it indicates that the shutdown procedure didn't wait for the "ldmadmin stop" process to terminate. Terminating that process prematurely would result in the file "ldmd.pid" not being deleted as well as the other problems you've encountered.
I recommend investigating your shutdown procedure.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/LDM/issues/65#issuecomment-359540377, or mute the thread https://github.com/notifications/unsubscribe-auth/AARy-Dma64o7DiHHfMEEuAHJqoxrfSPYks5tNOPWgaJpZM4Rc6K9 .
If the command "ldmadmin stop" correctly stops the LDM system so that a subsequent "ldmadmin start" correctly restarts it but shutting down the computer results in problems, then the problem is the shutdown procedure and not the LDM. Does your shutdown procedure execute the command "ldmadmin stop"? If not, then it should. If it does, then does it wait for that command to terminate? If not, then it should.
Perhaps you should execute the command "ldmadmin stop" (and wait for it to terminate) before executing "shutdown -r".
I think I might know your problem. Is something like the boot-time script at https://www.unidata.ucar.edu/software/ldm/ldm-current/basics/configuring.html#boot
installed in /etc/init.d
and has chkconfig(1)
been run on it? If not, then shutdown
won't know how to stop the LDM system.
That would seem to be the case. Haven't done a reboot yet to confirm that all the problems are solved, but this seems to be what was needed.
I was working off of this page: https://www.unidata.ucar.edu/software/ldm/ldm-current/basics/index.html. I see now that the init script and chkconfig stuff is in under the Configuring an LDM install page, which I skipped because I already had my ldmd.conf and other configuration files. I also figured that the make install
did the job of installing the init script and calling chkconfig because it needed the root password. You can't tell what it is doing in those steps.
Perhaps the "Ensure that the LDM is started at boot-time" page should be pulled out and put into its own header in the table of contents? After reading it, I can see why it would be tricky to automatically do this part in the makefile.
This was fixed with 6.13.11.
v6.13.7 isn't available at ftp://ftp.unidata.ucar.edu/pub/ldm/. I tried to use the tarball from github, but it doesn't have the configure script, and there isn't any documentation on how to build it via autoconf.