Unidata / LDM

The Unidata Local Data Manager (LDM) system includes network client and server programs designed for event-driven data distribution, and is the fundamental component of the Unidata Internet Data Distribution (IDD) system.
http://www.unidata.ucar.edu/software/ldm
Other
43 stars 27 forks source link

6.13.16 scour attempts to `lstat` a file deleted via another thread, race condition #103

Open akrherz opened 2 years ago

akrherz commented 2 years ago

I am using LDM 6.13.16 on Centos 8 Stream 64 bit. I've noticed that since the upgrade to this release, I sometimes get errors like the following from ldmadmin scour

20211028T090217.262441Z scour[118662] scour.c:scourFilesAndDirs:291 ERROR lstat("/data/gempak/nexrad/NIDS/LVX/N3K/N3K_20211026_0045") failed: No such file or directory

Out of deleting thousands of files, I only see one or two errors reported on some days, but not all. I know that you recently updated scour to use c code and not perl, perhaps there is some threading / race condition with how files are deleted?

The /data path is NFS mounted, so perhaps there is troubles there. I verified that I am only running 1 scour process from cron and this is my scour.conf

/mesonet/data/gempak/model      1
/data/gempak/model              10
/data/gempak/nexrad             2
/data/gempak                    8
/data/rcm                       7
/data/text                      14

Thanks.

mustbei commented 2 years ago

Hi Daryl,

This looks indeed like a race condition. Your scour.conf has overlapping directory entries (lines 3 and 4 in this case). The scour program launches a thread for each line. Therefore, by the time one thread reads a file under one directory that file may have already been seen in the other thread and deleted. Hence, leaving the first thread wondering and displaying the ERROR above. Note that the age for each directory entry is 2 and 8. Therefore, the missing file under /data/gempak/nexrad must have been age 8 or older. One way of preventing this rare case from happening is to lock the resource.

Best regards, --Mustapha

akrherz commented 2 years ago

Greetings, thanks for the response. The age of the file is less than 8, you can see that by the filename timestamp. So yeah, the lstat would perhaps be attempting to lookup a file that was deleted by the other thread..

sebenste commented 2 years ago

I have seen this occur in the "old" way of scouring as well in LDM 6.13.10 and earlier, but it's now a moot point.

akrherz commented 2 years ago

Perhaps a command line switch could be offered to disable threaded scouring? Or maybe this particular error could be sent to a lower priority log level?

mustbei commented 2 years ago

The new scour program spawns as many threads as there are directory entries (in scour.conf.) Therefore, to make it mono-threaded (without code change) it suffices to provide one directory entry at a time (to ensure non-concurrency.) It is also possible to enforce sequentiality with minor code change and a switch if warranted. Setting this error to a lower priority log level is also possible and only requires minimum code change.

semmerson commented 2 years ago

@akrherz Or one could modify their scour(1) configuration-file to avoid overlapping entries.

akrherz commented 2 years ago

@akrherz Or one could modify their scour(1) configuration-file to avoid overlapping entries.

Agreed, but that is brittle as I may add a new folder and forget to add a custom entry for it and very annoying as I have to add one entry for each sub-folder. Additionally, overlapping entries make total sense in my mind.

I have a blanket policy for anything in /data/gempak being at most 10 days old and then anything in /data/gempak/nexrad being at most 2 days old.