eclipse / paho.golang

Go libraries
Other
330 stars 92 forks source link

TestFileStore fails if files are put on disk too fast #205

Closed XANi closed 8 months ago

XANi commented 9 months ago

tested on c8bbcb21d1c36a2ba5543f38f45f640edd211858

Test fails. Test succeeds when I add a milisecond delay to write function which makes me think OS timestamp size doesn't have enough resolution for

    sort.Slice(ids, func(i, j int) bool {
        return ids[i].modTime.Before(ids[j].modTime)
    })

to be reliable way to keep it in order

Describe the bug A clear and concise description of what the bug is.

--- FAIL: TestFileStore (0.00s)
    store_test.go:73: List returned [2 32300 5890 65535], expected [65535 2 32300 5890]

Adding a milisecond delay after write causes this to not show up It was on XFS filesystem running on NVMe drive.

EDIT: same thing here

=== RUN   TestFileQueue
    queue_test.go:80: expected "Queue entry 0 for testing", got "Queue entry 7 for testing"

adding delay also helps

Debug output

I've redirected test files to /tmp/asd for debug:

-> ᛯ ls -la /tmp/asd
total 24
drwxr-xr-x  2 xani xani   81 Nov 19 16:59 .
drwxrwxrwt 26 root root 4096 Nov 19 17:00 ..
-rw-------  1 xani xani    8 Nov 19 16:59 foo2.ext
-rw-------  1 xani xani   12 Nov 19 16:59 foo32300.ext
-rw-------  1 xani xani   11 Nov 19 16:59 foo5890.ext
-rw-------  1 xani xani   12 Nov 19 16:59 foo65535.ext
-> ᛯ stat /tmp/asd/foo2.ext    
  File: /tmp/asd/foo2.ext
  Size: 8           Blocks: 8          IO Block: 4096   regular file
Device: 254,2   Inode: 100675564   Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/    xani)   Gid: ( 1000/    xani)
Access: 2023-11-19 16:59:03.773243442 +0100
Modify: 2023-11-19 16:59:03.773243442 +0100
Change: 2023-11-19 16:59:03.773243442 +0100
 Birth: 2023-11-19 16:59:03.773243442 +0100
[17:00:27] ^ [~/src/os/paho.golang] {master *|u=}
-> ᛯ stat /tmp/asd/foo32300.ext
  File: /tmp/asd/foo32300.ext
  Size: 12          Blocks: 8          IO Block: 4096   regular file
Device: 254,2   Inode: 100675972   Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/    xani)   Gid: ( 1000/    xani)
Access: 2023-11-19 16:59:03.773243442 +0100
Modify: 2023-11-19 16:59:03.773243442 +0100
Change: 2023-11-19 16:59:03.773243442 +0100
 Birth: 2023-11-19 16:59:03.773243442 +0100

Timestamps are "same" on filesystem which causes wrong sorting.

My assumption here is that because there is no Sync() all of the writes are flushed at once hence same timestamp. This patch below "fixes" it but I'd wager bug would resurface if ever this code is used on filesystem that have lower timestamp accuracy so I'm hesitant to make a pull request calling it "fixed"

diff --git a/paho/store/file/store.go b/paho/store/file/store.go
index ed82547..c740a7a 100644
--- a/paho/store/file/store.go
+++ b/paho/store/file/store.go
@@ -86,6 +86,7 @@ func (s *Store) Put(packetID uint16, packetType byte, w io.WriterTo) error {
        _ = os.Remove(tmpFn)
        return fmt.Errorf("failed to write packet to temp file: %w", err)
    }
+   f.Sync()
    if err = f.Close(); err != nil {
        _ = os.Remove(tmpFn)
        return fmt.Errorf("failed to close temp file: %w", err)
MattBrittan commented 9 months ago

The file system resolution limitation is noted at the top of store.go:

// NOTE: Order is maintained using file ModTime, so there may be issues if the interval between messages is less than
// the file system ModTime resolution.

While there are ways to solve this (i.e. use the time as the filename or similar) this would complicate things (and I'm not sure that this is a real issue).

My assumption here is that because there is no Sync() all of the writes are flushed at once hence same timestamp. This patch below "fixes" it but I'd wager bug would resurface if ever this code is used on filesystem...

It's difficult to know if the sync itself is resolving the issue of if it's the delay introduced by calling f.Sync() that is hiding the problem. Having said that calling Sync probably does make sense as it should improve resilience somewhat (at the cost of performance).

One further improvement we could make here is to sort by timestamp and then message ID. Order is only important on outgoing messages and the message ID is increased with each message so, in most cases, ordering by ID would be correct (I believe this is what the c client does). The issue with ordering solely by message ID is that the max ID is 65535 and this will then rollover and start from 1 again (potentially an oversight in the C client).

XANi commented 9 months ago

While there are ways to solve this (i.e. use the time as the filename or similar) this would complicate things (and I'm not sure that this is a real issue).

I've filed it as "the test is failing" and not "the file store code is wrong" for that reason, I wasn't really sure if it was worth fixing it as in reality there is very little chance it will actually happen and it is just test that is failing.

It's difficult to know if the sync itself is resolving the issue of if it's the delay introduced by calling f.Sync() that is hiding the problem. Having said that calling Sync probably does make sense as it should improve resilience somewhat (at the cost of performance).

Oh, no, it's very simple, just replace t.TempDir() with different path and look at files. With fsync (TestFileQueue):

-> ᛯ for a in `find /tmp/z -type f` ; do stat $a |grep -P 'File|Modify' ; done                                                                         2
  File: /tmp/z/queueTest-393584234.que
Modify: 2023-11-19 23:10:10.511411195 +0100
  File: /tmp/z/queueTest-3221526884.que
Modify: 2023-11-19 23:10:10.515411252 +0100
  File: /tmp/z/queueTest-1687121810.que
Modify: 2023-11-19 23:10:10.519411309 +0100

without:

-> ᛯ for a in `find /tmp/z -type f` ; do stat $a |grep -P 'File|Modify' ; done                                                                         1
  File: /tmp/z/queueTest-2609354175.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-2566205542.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-4066513764.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-3133974357.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-2867780127.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-2669831688.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-2554827075.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-420084505.que
Modify: 2023-11-19 23:13:54.490602089 +0100
  File: /tmp/z/queueTest-867296741.que
Modify: 2023-11-19 23:13:54.490602089 +0100

what's more, I can do

@@ -49,7 +49,7 @@ func TestFileQueue(t *testing.T) {
                if err := q.Enqueue(bytes.NewReader([]byte(fmt.Sprintf(entryFormat, i)))); err != nil {
                        t.Fatalf("error adding entry %d: %s", i, err)
                }
-               time.Sleep(time.Nanosecond) // Short delay due to file system time resolution
+               time.Sleep(time.Millisecond) // Short delay due to file system time resolution
        }
        // Remove the initial "This is a test" entry
        if entry, err := q.Peek(); err != nil {

and here is the result:

 for a in `find /tmp/z -type f` ; do stat $a |grep -P 'Modify' ; done 
Modify: 2023-11-19 23:14:43.471299887 +0100
Modify: 2023-11-19 23:14:43.471299887 +0100
Modify: 2023-11-19 23:14:43.475299944 +0100
Modify: 2023-11-19 23:14:43.475299944 +0100
Modify: 2023-11-19 23:14:43.475299944 +0100
Modify: 2023-11-19 23:14:43.479300001 +0100
Modify: 2023-11-19 23:14:43.479300001 +0100
Modify: 2023-11-19 23:14:43.479300001 +0100
Modify: 2023-11-19 23:14:43.479300001 +0100

as you can see even with 1ms latency between calls, the writes still get batched with same timestamp.

I'll make PR with Sync change later; it's probably quicker than having timestamp in filename

MattBrittan commented 8 months ago

Thanks very much for logging this (and your analysis!); as I'm tidying up prior to the next release I have gone ahead and made the change.