Unidata / LDM

The Unidata Local Data Manager (LDM) system includes network client and server programs designed for event-driven data distribution, and is the fundamental component of the Unidata Internet Data Distribution (IDD) system.
http://www.unidata.ucar.edu/software/ldm
Other
43 stars 27 forks source link

pqact state files should be accurate to the process and not all processes #36

Closed akrherz closed 9 years ago

akrherz commented 9 years ago

With the current release, all pqact processes appear to write the same .state file and not a .state file that is accurate to that pqact process. For example, on a machine that has 9 pqact process, the written out .state file for each at shutdown time has the same timestamp!

Here's the problem. Yesterday, my system feeds NEXRAD2 from idd.unidata.ucar.edu and other feeds from idd.ssec.wisc.edu. The NEXRAD2 feed became very latent, with a feed latency at shutdown time of ~30 minutes. I have 7 pqact processes that only look at that feedtype, so their queue time should have been 30 minutes in the past. The IDS feedtype was current, so this resulted in .state files with the current timestamp . So when I started LDM back up, the 7 pqact processes ignored the incoming new NEXRAD2 data because the pqinsert times were within that recent 30 minute window. After some cussing, I realized what had happened, stopped LDM, removed the state files and started LDM again. Of course, data was missed during this period :(

Please modify pqact to write state files that represent where exactly that pqact process is within the LDM queue so that I don't loose data during restarts. Even in the normal case of a machine with multiple pqact processes, this would result in products being missed.

semmerson commented 9 years ago

Daryl,

The modification times of the ".state" files are a red herring because the files are written when the LDM shuts down.

Each file identifies the last data-product that was successfully processed by the pqact(1) process associated with the similarly-named configuration-file (which is why pqact(1) processes and their configuration-files need to be one-to-one). pqact(1) uses this information to start processing where it left off (assuming the identified data-product is still in the product-queue).

Consequently, you shouldn't have encountered the problem you did unless something else was wrong.

Were "processed oldest product" messages being written to the LDM log file prior to this? This would violate the assumption at the end of the second paragraph.

Regards, Steve Emmerson

On Thu, May 7, 2015 at 6:43 AM, daryl herzmann notifications@github.com wrote:

With the current release, all pqact processes appear to write the same .state file and not a .state file that is accurate to that pqact process. For example, on a machine that has 9 pqact process, the written out .state file for each at shutdown time has the same timestamp!

Here's the problem. Yesterday, my system feeds NEXRAD2 from idd.unidata.ucar.edu and other feeds from idd.ssec.wisc.edu. The NEXRAD2 feed became very latent, with a feed latency at shutdown time of ~30 minutes. I have 7 pqact processes that only look at that feedtype, so their queue time should have been 30 minutes in the past. The IDS feedtype was current, so this resulted in .state files with the current timestamp . So when I started LDM back up, the 7 pqact processes ignored the incoming new NEXRAD2 data because the pqinsert times were within that recent 30 minute window. After some cussing, I realized what had happened, stopped LDM, removed the state files and started LDM again. Of course, data was missed during this period :(

Please modify pqact to write state files that represent where exactly that pqact process is within the LDM queue so that I don't loose data during restarts. Even in the normal case of a machine with multiple pqact processes, this would result in products being missed.

— Reply to this email directly or view it on GitHub https://github.com/Unidata/LDM/issues/36.

akrherz commented 9 years ago

Hi Steve,

I am not referencing the file modification times, I am talking about the actual contents of the file.

[ldm@mtarchive-public etc]$ diff pqact-craft.conf.state pqact.conf.state 
[ldm@mtarchive-public etc]$ $ cat *.state | grep -v '#'
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
1430952483.480536
[ldm@mtarchive-public etc]$ ps uax | grep pqact
ldm      21448  1.3 14.5 4035580 3583072 ?     S    May06  13:38 pqact /home/ldm/etc/pqact.conf
ldm      21449  0.4 14.5 4035052 3573024 ?     S    May06   4:54 pqact -p BZIP2/K[A-D] -f CRAFT /home/ldm/etc/pqact-craft.conf
ldm      21450  0.4 13.9 4035052 3434408 ?     S    May06   4:22 pqact -p BZIP2/K[E-H] -f CRAFT /home/ldm/etc/pqact-craft2.conf
ldm      21451  0.3 12.7 4035052 3130216 ?     S    May06   3:11 pqact -p BZIP2/K[I-K] -f CRAFT /home/ldm/etc/pqact-craft3.conf
ldm      21452  0.4 14.2 4035052 3502000 ?     S    May06   4:39 pqact -p BZIP2/K[L-O] -f CRAFT /home/ldm/etc/pqact-craft4.conf
ldm      21453  0.3 12.5 4035052 3085248 ?     S    May06   3:21 pqact -p BZIP2/K[P-R] -f CRAFT /home/ldm/etc/pqact-craft5.conf
ldm      21454  0.4 14.1 4035052 3469616 ?     S    May06   4:12 pqact -p BZIP2/K[S-Z] -f CRAFT /home/ldm/etc/pqact-craft6.conf
ldm      21455  0.2  9.9 4035052 2440172 ?     S    May06   2:39 pqact -p BZIP2/[A-J] -f CRAFT /home/ldm/etc/pqact-craft7.conf
ldm      21456  0.3 12.5 4035052 3096664 ?     S    May06   3:27 pqact -p BZIP2/[L-Z] -f CRAFT /home/ldm/etc/pqact-craft8.conf

Were "processed oldest product" messages being written to the LDM log file prior to this?

no.

semmerson commented 9 years ago

That shouldn't be possible -- so an assumption must be invalid.

Would you please post the output of the following commands -- executed by the LDM user in their etc subdirectory:

ls -l *.state*
ls -l *pqact*.conf
grep -i '^exec.*pqact' ldmd.conf
akrherz commented 9 years ago

Hi Steve, every system I check has this same thing, each .state file has the same content timestamp within. Do your systems not show this?

[ldm@mtarchive-public etc]$ ls -l *.state*
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft2.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft3.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft4.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft5.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft6.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft7.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft8.conf.state
-rw-rw-r-- 1 ldm ldm 183 May  6 17:48 pqact-craft.conf.state
[ldm@mtarchive-public etc]$ ls -l *pqact*.conf
lrwxrwxrwx 1 ldm ldm 33 Jan 19 14:18 pqact.conf -> ../ldmconfig/mtarchive/pqact.conf
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft2.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft3.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft4.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft5.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft6.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft7.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:20 pqact-craft8.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
lrwxrwxrwx 1 ldm ldm 34 Jan 19 14:21 pqact-craft.conf -> ../ldmconfig/mtarchive/pqact.CRAFT
[ldm@mtarchive-public etc]$ grep -i '^exec.*pqact' ldmd.conf
EXEC    "pqact /home/ldm/etc/pqact.conf"
exec    "pqact -p BZIP2/K[A-D] -f CRAFT /home/ldm/etc/pqact-craft.conf"
exec    "pqact -p BZIP2/K[E-H] -f CRAFT /home/ldm/etc/pqact-craft2.conf"
exec    "pqact -p BZIP2/K[I-K] -f CRAFT /home/ldm/etc/pqact-craft3.conf"
exec    "pqact -p BZIP2/K[L-O] -f CRAFT /home/ldm/etc/pqact-craft4.conf"
exec    "pqact -p BZIP2/K[P-R] -f CRAFT /home/ldm/etc/pqact-craft5.conf"
exec    "pqact -p BZIP2/K[S-Z] -f CRAFT /home/ldm/etc/pqact-craft6.conf"
exec    "pqact -p BZIP2/[A-J] -f CRAFT /home/ldm/etc/pqact-craft7.conf"
exec    "pqact -p BZIP2/[L-Z] -f CRAFT /home/ldm/etc/pqact-craft8.conf"
akrherz commented 9 years ago

Steve, I wonder about pqact/pqact.c, it appears current Cursor gets updated for each product LDM receives, regardless of if it was processed by pqact or not, since the status is always 0 from processProduct. That may explain why currentCursor would be the same for all pqact processes?

semmerson commented 9 years ago

You appear to be right. processProduct() always returns 0, which will cause the last data-product matching the -f feed and -p pattern options of pqact(1) to be used regardless of whether or not it matched any pqact(1) configuration-file entries.

The workaround, therefore, is to set the -f feed and -p pattern options of all pqact(1) entries in the LDM configuration-file to be consonant with the entries in the corresponding pqact(1) configuration-file.

I'll see about fixing this in the next release.

Thanks for sending this in.

akrherz commented 9 years ago

Thanks Steve, I am unsure of your workaround. Do you have any systems that do not exhibit this behavior currently? I added -p .* -f ANY to the pqact process above that currently does not have those options set, but there was no change in the .state file result at restart.

akrherz commented 9 years ago

Steve, maybe there is another bug here. I don't see how setting -p pattern on the pqact process causes it to limit the number of products exposed to its conf file pattern actions? It seems like they all get evaluated by pq_sequence() logic?

EDIT: yes, but prodInClass(clss, info) will avoid the ifMatch logic. Interesting, but the result is that every product ends up bumping the currentCursor regardless of it if matches the command line args

semmerson commented 9 years ago

I've modified processProduct() so that it will only return 0 if the data-product was actually processed. This will prevent currentCursor from being modified by data-products that merely match the -f and -p option arguments.

semmerson commented 9 years ago

I wonder if this is the right thing to do, however. Should the "last processed product" attribute be determined by the -f and -p options or by configuration-file entries?

akrherz commented 9 years ago

Steve, indeed, I am boggling this too. IMO, pqact's state file should contain the queue insert time of the last processed product that matched a pqact.conf entry. This is the point at which is should pick up processing again. If new pqact.conf entries are added that process what would have been skipped products from a previous pqact.conf entry, so be it. I think. I am still boggling pq_sequence() and returning the problematic 0 and not processProduct()

semmerson commented 9 years ago

I am still boggling pq_sequence() and returning the problematic 0 and not processProduct()

I'm afraid I don't know what that means.

akrherz commented 9 years ago

Steve, I believe the bug is located at line 672 of pqact/pqact.c, here pq_sequence() returns 0 regardless of if the product

  1. matched the regex of pqact process
  2. matched the regex of the pattern action file
  3. was successfully processed by a pqact child

any product will then bump currentCursor

akrherz commented 9 years ago

maybe the solution then is to add another cursor to keep track of last-proccessed-product and then persist that value to the .state file?

semmerson commented 9 years ago

It looks like pq_sequence() returns 0 if the data-product doesn't match the data-product class specification.

I think the solution is to move the setting of currentCusor into processProduct() so that it's only modified if the data-product is processed.

akrherz commented 9 years ago

Hi Steve, sounds to be good to me... I am been looking through the code more and it seems that thepqact state file has two purposes.

  1. tell pqact where to move to within the queue to start looking for products
  2. set the timestamp floor on any future products to process found in the queue going forward.

Perhaps an immediate workaround here is to set pqact flags -o and -t on startup. This would effectively decrement the timestamp found in the state file back to allow any new products that arrive from the upstream latent host to be processed. What do you think?

semmerson commented 9 years ago

I'd have to analyze the code to determine if that's a viable workaround -- and if I'm going to do that, then I'd rather just fix the code.

It's worth a try, however.

semmerson commented 9 years ago

Fixed in the next release.

akrherz commented 9 years ago

Hi Steve, I rolled out 6.12.10 to one LDM host today and immediately started seeing lots of logs like this:

May 20 15:58:59 mesonet-director1 journal: pqact[13363] WARN: Processed oldest product in queue: 629.666 s

I reverted back to 6.12.9 and the messages are not happening anymore. I am wondering if this is related to this issue or perhaps the bug mentioned by Gilbert in the CHANGE_LOG.

semmerson commented 9 years ago

Yeah. There seems to be a problem with the logic behind this fix. I'm working on it.

semmerson commented 9 years ago

LDM 6.12.10 has a bug. Wait for version 6.12.11, which should be out today.