Unidata / LDM

The Unidata Local Data Manager (LDM) system includes network client and server programs designed for event-driven data distribution, and is the fundamental component of the Unidata Internet Data Distribution (IDD) system.
http://www.unidata.ucar.edu/software/ldm
Other
43 stars 27 forks source link

v6.13.7 not available? #65

Closed WeatherGod closed 6 years ago

WeatherGod commented 6 years ago

v6.13.7 isn't available at ftp://ftp.unidata.ucar.edu/pub/ldm/. I tried to use the tarball from github, but it doesn't have the configure script, and there isn't any documentation on how to build it via autoconf.

WeatherGod commented 6 years ago

I want to try out v6.13.7 to see if it solves a problem of mine on reboot, but I can't build it without a configure script.

WeatherGod commented 6 years ago

I found CI-commit.sh and am trying to go that approach to make my own configure. I'll point out that following those instructions exactly yields:

-bash-4.1$ autoreconf -if
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `build-aux'.
libtoolize: copying file `build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIR, `m4'.
libtoolize: copying file `m4/libtool.m4'
libtoolize: copying file `m4/ltoptions.m4'
libtoolize: copying file `m4/ltsugar.m4'
libtoolize: copying file `m4/ltversion.m4'
libtoolize: copying file `m4/lt~obsolete.m4'
configure.ac:83: installing `build-aux/compile'
configure.ac:82: installing `build-aux/config.guess'
configure.ac:82: installing `build-aux/config.sub'
configure.ac:11: installing `build-aux/install-sh'
configure.ac:11: installing `build-aux/missing'
fauxPq/Makefile.am: installing `build-aux/depcomp'
mcast_lib/Makefile.am:9: required directory mcast_lib/FMTP-LDM7/UnidataFMTP does not exist
autoreconf: automake failed with exit status: 1

If I make that directory myself, then it completes making a configure script.

WeatherGod commented 6 years ago

Is there a guide that explains this stuff better? The autoreconf is making a makefile that requires doxygen to complete an install.

semmerson commented 6 years ago

A few things:

  1. Version 6.13.7 isn't ready; and
  2. The code repository is intended for software developers only. Development requires installation of Doxygen, tcl, CUNIT, Google Test, etc. If you don't intend to develop the package, then you should definitely use the distribution instead.
  3. There is no development guide because that's too low a priority.

What's the problem you're seeing on reboot with v6.13.6?

WeatherGod commented 6 years ago
  1. So why do you have it as a tag? Instead, have a v6.13.x branch. It is entirely misleading, especially since github thinks they are releases.
  2. Agreed, which is why I was surprised that there was a tag and a "release" tarball, but not over on the ftp site. Second, that list of developer's dependency would be a good start for documentation.
  3. Take it from me. The project that I am heavily involved in, matplotlib, would have been dead in the water 5 years ago if we didn't have some semblance of a release guide. The project's founder made it a point to recruit contributors to be co-release managers. Each manager would fix up the documentation as they went through a trial-by-fire, and the bus factor was slowly grown. Looking over some of the scripts you have, it seems to be very tied to a single person's account, and its specific environment.

The bug I am having is the pid file is not getting cleaned up on a clean reboot of the machine, and it prevents the LDM from starting up properly. I am guessing that 3aa162 might address it, so I wanted to test it out to confirm it before officially deploying a fix to my production systems.

On Sat, Jan 13, 2018 at 12:32 PM, Steven Emmerson notifications@github.com wrote:

A few things:

  1. Version 6.13.7 isn't ready; and
  2. The code repository is intended for software developers only. Development requires installation of Doxygen, tcl, CUNIT, Google Test, etc. If you don't intend to develop the package, then you should definitely use the distribution instead.
  3. There is no development guide because that's too low a priority.

What's the problem you're seeing on reboot with v6.13.6?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/LDM/issues/65#issuecomment-357452066, or mute the thread https://github.com/notifications/unsubscribe-auth/AARy-H2SscNQdiyXGmwIFjV_YqVFTfERks5tKOjCgaJpZM4Rc6K9 .

semmerson commented 6 years ago

If the file $HOME/ldmld.pid is preventing the LDM system from starting when the computer is rebooted, then you should add the command ldmadmin clean to the boot-time script for the LDM before the LDM is started. Be sure to execute it as the LDM user.

WeatherGod commented 6 years ago

That's all fine and dandy (although, I would rather not drop my existing queue after a weekly reboot after security updates). My question was, would that commit prevent the problem in the first place?

semmerson commented 6 years ago

Assuming that by "that commit" you mean the not-yet-released version 6.13.7, then no: that version doesn't do anything different with $HOME/ldmd.pid.

That file should be removed by the command ldmadmin stop. If it's not, then it's likely that command wasn't used to stop the LDM system. You should ensure that it is.

WeatherGod commented 6 years ago

By "that commit", I mean 3aa162adba

We are doing a normal system reboot, which should go through the normal shutdown process for all services. I figure that the service is getting the signal for shutdown, but not properly executing shutdown, which includes the cleanup of the pid file. If that is not the case, then there is some other bug that needs to be fixed.

I was hoping to confirm whether or not the bug was fixed by what was in master before filing a bug report.

semmerson commented 6 years ago

It's easy enough to check whether or not an ldmadmin stop removes the PID file: just execute it manually. Let me know if it doesn't remove the file. If it does remove the file, however, then there's a problem with your shutdown procedure.

WeatherGod commented 6 years ago

Indeed, that seems to work fine.

My system is a stock CentOS 6. And the system shutdown was triggered by a manual shutdown -r now. Here is the relevant ldmd log entry for that unclean ldm shutdown:

20180116T154504.048974Z pqact[3892] NOTE pqact.c:111:cleanup() Exiting
20180116T154504.049099Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.049176Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050738Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.050803Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050874Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.050916Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.050987Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.051043Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.051808Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.051869Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.051994Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.052047Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.052129Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.052174Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.054827Z mrms-ldmout.ncep.noaa.gov[3897] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.054887Z mrms-ldmout.ncep.noaa.gov[3897] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.056733Z mrms-ldmout.ncep.noaa.gov[3899] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.056799Z mrms-ldmout.ncep.noaa.gov[3899] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.058140Z mrms-ldmout.ncep.noaa.gov[3900] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.058201Z mrms-ldmout.ncep.noaa.gov[3900] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.059828Z ldmd[3890] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.059964Z ldmd[3890] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.060207Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 5) failed
20180116T154504.060267Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 6) failed
20180116T154504.060291Z ldmd[3890] NOTE ldmd.c:258:cleanup() Terminating process group
20180116T154504.060527Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.062972Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.059964Z ldmd[3890] NOTE ldmd.c:187:cleanup() Exiting
20180116T154504.060207Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 5) failed
20180116T154504.060267Z ldmd[3890] ERROR ldmd.c:232:cleanup() pmap_unset(LDMPROG 300029, LDMVERS 6) failed
20180116T154504.060291Z ldmd[3890] NOTE ldmd.c:258:cleanup() Terminating process group
20180116T154504.060527Z mrms-ldmout.ncep.noaa.gov[3901] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.062972Z mrms-ldmout.ncep.noaa.gov[3902] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.064730Z mrms-ldmout.ncep.noaa.gov[3893] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.095231Z mrms-ldmout.ncep.noaa.gov[3898] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.096755Z pqact[3892] NOTE pqact.c:128:cleanup() Behind by 12.479 s
20180116T154504.097370Z mrms-ldmout.ncep.noaa.gov[3896] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.103982Z mrms-ldmout.ncep.noaa.gov[3894] NOTE ldmd.c:306:signal_handler() SIGTERM received
20180116T154504.104973Z mrms-ldmout.ncep.noaa.gov[3895] NOTE ldmd.c:306:signal_handler() SIGTERM received

Then, we have a few of the follow entries in the log periodically, until I finally realized that the LDM didn't start up (yes, I know how to get notified, but our emails weren't working at that moment).

20180116T160001.787528Z uldbutil[1865] NOTE uldb.c:1071:sm_setShmId() No such file or directory
20180116T160001.789038Z uldbutil[1865] NOTE uldb.c:1071:sm_setShmId() Couldn't get shared-memory segment identifier
20180116T160001.789059Z uldbutil[1865] NOTE uldb.c:1095:sm_attach() Couldn't get shared-memory segment
20180116T160001.789070Z uldbutil[1865] NOTE uldb.c:1176:sm_init() Couldn't attach shared-memory segment
20180116T160001.789079Z uldbutil[1865] NOTE uldb.c:1970:uldb_open() Couldn't initialize shared-memory component
20180116T160001.789087Z uldbutil[1865] NOTE uldbutil.c:95:main() The upstream LDM database doesn't exist
20180116T160001.789096Z uldbutil[1865] NOTE uldbutil.c:96:main() Is the LDM running?
semmerson commented 6 years ago

The multiple "Terminating process group" messages in the log file indicate that more than one SIGTERM was sent to the top-level LDM server.

The lack of "Exiting" messages from multiple downstream LDM processes indicates that the shutdown procedure didn't wait for the LDM system to terminate.

It's likely that one or both of these behaviors is the cause of the problem you're encountering.

WeatherGod commented 6 years ago

What could have sent more than one SIGTERM? Like I said, I am using a bare CentOS 6 VM. It is used just for this one LDM. I don't have anything non-standard (that I am aware of). Is there anything else I can do to help you diagnose/debug this problem?

semmerson commented 6 years ago

The timestamps on the duplicated messages from the top-level LDM server (PID 3890) are identical to the microsecond. It would seem, therefore, that you have a logging or cut-and-paste problem.

I would be more concerned about the lack of "Exiting" message because it indicates that the shutdown procedure didn't wait for the "ldmadmin stop" process to terminate. Terminating that process prematurely would result in the file "ldmd.pid" not being deleted as well as the other problems you've encountered.

I recommend investigating your shutdown procedure.

WeatherGod commented 6 years ago

"investigating your shutdown procedure" --- ????

What do you mean by shutdown procedure? Something wrong with shutdown -r now? Or something wrong with sysvinit (or whatever the system is called nowadays)? Or something wrong with the shutdown script for ldm? There is nothing custom about my system. It is a stock CentOS 6 VM. And I installed LDM from the source tarball available from the FTP site as-is with no modifications.

On Mon, Jan 22, 2018 at 2:39 PM, Steven Emmerson notifications@github.com wrote:

The timestamps on the duplicated messages from the top-level LDM server (PID 3890) are identical to the microsecond. It would seem, therefore, that you have a logging or cut-and-paste problem.

I would be more concerned about the lack of "Exiting" message because it indicates that the shutdown procedure didn't wait for the "ldmadmin stop" process to terminate. Terminating that process prematurely would result in the file "ldmd.pid" not being deleted as well as the other problems you've encountered.

I recommend investigating your shutdown procedure.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/LDM/issues/65#issuecomment-359540377, or mute the thread https://github.com/notifications/unsubscribe-auth/AARy-Dma64o7DiHHfMEEuAHJqoxrfSPYks5tNOPWgaJpZM4Rc6K9 .

semmerson commented 6 years ago

If the command "ldmadmin stop" correctly stops the LDM system so that a subsequent "ldmadmin start" correctly restarts it but shutting down the computer results in problems, then the problem is the shutdown procedure and not the LDM. Does your shutdown procedure execute the command "ldmadmin stop"? If not, then it should. If it does, then does it wait for that command to terminate? If not, then it should.

Perhaps you should execute the command "ldmadmin stop" (and wait for it to terminate) before executing "shutdown -r".

semmerson commented 6 years ago

I think I might know your problem. Is something like the boot-time script at https://www.unidata.ucar.edu/software/ldm/ldm-current/basics/configuring.html#boot installed in /etc/init.d and has chkconfig(1) been run on it? If not, then shutdown won't know how to stop the LDM system.

WeatherGod commented 6 years ago

That would seem to be the case. Haven't done a reboot yet to confirm that all the problems are solved, but this seems to be what was needed.

I was working off of this page: https://www.unidata.ucar.edu/software/ldm/ldm-current/basics/index.html. I see now that the init script and chkconfig stuff is in under the Configuring an LDM install page, which I skipped because I already had my ldmd.conf and other configuration files. I also figured that the make install did the job of installing the init script and calling chkconfig because it needed the root password. You can't tell what it is doing in those steps.

Perhaps the "Ensure that the LDM is started at boot-time" page should be pulled out and put into its own header in the table of contents? After reading it, I can see why it would be tricky to automatically do this part in the makefile.

sebenste commented 5 years ago

This was fixed with 6.13.11.