Unidata / LDM

The Unidata Local Data Manager (LDM) system includes network client and server programs designed for event-driven data distribution, and is the fundamental component of the Unidata Internet Data Distribution (IDD) system.
http://www.unidata.ucar.edu/software/ldm
Other
43 stars 27 forks source link

Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

Closed sebenste closed 2 years ago

sebenste commented 3 years ago

OS: Centos 7, fully updated

This bug has actually been around for many years, but I was hoping it was vanquished. I guess not...

We were bit by this bug this evening, When executing an "ldmadmin restart", it has an issue whereby upon starting up, the LDM will start normally, except within hours, the queue will become corrupt, and data won't write to physical media. You must then stop the LDM, remake the queue, and start it again. This fixes the issue. Every time you delete the queue after you stop it, upon restart, everything is fine. But, occasionally, if you only do an "ldmadmin restart", the queue becomes corrupt. This is more likely to happen, in my experience, if:

You are running a high-volume, high file-size count feed (think Level2 radar or CONDUIT) If you do multiple restarts of the LDM, spaced hours or more apart

This does NOT happen, ever, if the queue is deleted and remade before restarting the LDM, even if you do this:

ldmadmin stop ldmadmin clean ldmadmin delqueue ldmadmin restart

Or this:

ldmadmin stop ldmamdin delqueue ldmadmin mkqueue ldmadmin start

It only happens when doing a straight "ldmadmin restart" command, nothing before or after it. Furthermore, it may not happen until hours after a restart.

semmerson commented 3 years ago

Are there any relevant messages in the LDM log file?

sebenste commented 3 years ago

Unfortunately, no.

Gilbert

From: Steven Emmerson @.> Sent: Saturday, March 27, 2021 12:27 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)

Are there any relevant messages in the LDM log file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-808765924, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2MTEJW76FGSUWHNWN3TFYIOBANCNFSM4Z4QBRPA.

semmerson commented 3 years ago

Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?

sebenste commented 3 years ago

No. That’s why this has been going on for…well, decades. I can’t reproduce it consistently, and it doesn’t happen often. All I know is, to trip the bug, you have to do an ldmadmin restart. Sometimes it takes one time to do, sometimes a bunch of times, and then sometimes, never. I can’t see any pattern to this.

An ldmadmin restart doesn’t indicate anything, but the queue becomes corrupt minutes to hours (or even a day) after it gets restarted. That’s the crazy thing about this…

From: Steven Emmerson @.> Sent: Monday, March 29, 2021 12:22 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)

Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-809559682, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2N3EQGCV6TKCG4ORADTGCZLFANCNFSM4Z4QBRPA.

semmerson commented 3 years ago

How have you determined that the queue becomes corrupt?

sebenste commented 3 years ago

Products stop writing to the disk. Whenever there is queue corruption, that always happens. And, remaking the queue and restarting the LDM always fixes it, without exception.

Gilbert

From: Steven Emmerson @.> Sent: Monday, March 29, 2021 1:42 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)

How have you determined that the queue becomes corrupt?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-809619998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2PN3EBWD6W75FYFOYLTGDCYXANCNFSM4Z4QBRPA.

semmerson commented 3 years ago

Do you have any processes running outside of the LDM's process group (i.e., not executed by an EXEC entry in the LDM configuration-file) that insert products into the queue?

sebenste commented 3 years ago

It happens on servers that do and do not insert products into the queue. It will even do it on our NOAAport ingester at our dish.

Gilbert From: Steven Emmerson @.> Sent: Tuesday, March 30, 2021 9:40 AM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)

Do you have any processes running outside of the LDM's process group (i.e., not executed by an EXEC entry in the LDM configuration-file) that insert products into the queue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-810317189, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2JWBVETUGUY2KJ53RTTGHPD5ANCNFSM4Z4QBRPA.

semmerson commented 3 years ago

Products stop writing to the disk. Whenever there is queue corruption, that always happens.

If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again.

sebenste commented 3 years ago

Unfortunately, the log got wiped out now so I can’t check. But, I will double check when it happens again. Sorry about that…

Gilbert

From: Steven Emmerson @.> Sent: Tuesday, March 30, 2021 12:31 PM To: Unidata/LDM @.> Cc: Gilbert Sebenste @.>; Author @.> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)

Products stop writing to the disk. Whenever there is queue corruption, that always happens.

If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Unidata/LDM/issues/89#issuecomment-810443802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLWO2IBZKKXUFM7UIV7XH3TGIDDZANCNFSM4Z4QBRPA.

sebenste commented 2 years ago

This was fixed in 6.13.14. Closing ticket.