ankitects / anki

Anki's shared backend and web components, and the Qt frontend
https://apps.ankiweb.net
Other
18.36k stars 2.1k forks source link

flaky tests: state_application, unicode_normalization #3353

Open martinetd opened 1 month ago

martinetd commented 1 month ago

Hello,

We run tests as part of the nixos build and it would appear some tests are flaky:

---- scheduler::answering::test::state_application stdout ----
thread 'scheduler::answering::test::state_application' panicked at rslib/src/scheduler/answering/mod.rs:593:9:
assertion `left == right` failed
  left: Normal(Learning(LearnState { remaining_steps: 2, scheduled_secs: 60, elapsed_secs: 0, memory_state: None }))
 right: Normal(Learning(LearnState { remaining_steps: 2, scheduled_secs: 60, elapsed_secs: 1, memory_state: None }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    scheduler::answering::test::state_application
---- media::check::test::unicode_normalization stdout ----
thread 'media::check::test::unicode_normalization' panicked at rslib/src/media/check.rs:723:13:
assertion `left == right` failed
  left: MediaCheckOutput { unused: ["ぱぱ.jpg"], missing: ["foo[.jpg", "normal.jpg"], missing_media_notes: [NoteId(1581236386334)], renamed: {"は\u{309a}は\u{309a}.jp>
 right: MediaCheckOutput { unused: [], missing: ["foo[.jpg", "normal.jpg"], missing_media_notes: [NoteId(1581236386334)], renamed: {"は\u{309a}は\u{309a}.jpg": "ぱぱ.>
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    media::check::test::unicode_normalization

I could reproduce the state_application one, running it in a tight loop (grabbing the exact test command from cargo test -vv to avoid the cargo overhead) reproduces within 10-60s on my laptop (~1/250 - 1/2000 runs)

That can be worked around by masking the elapsed_time explicitly but it's not great, looking at the code self.set_elapsed_secs_equal might interfere with the test if it runs right at a second boundary?

Unfortunately I couldn't reproduce the second one through sheer running it in a loop, either other tests interfere with some global state? or it could be an environment problem, I'm not sure. I also had a quick look and ぱぱ.jpg apparently shouldn't have been unused but I haven't gone far enough to figure what makes things unused and why it would be the case here...

dae commented 1 month ago

The second one is perhaps a race condition between writing the file, and the FS metadata being visible. ext4? We'll probably need to add a small delay, or force a flush to disk.

martinetd commented 1 month ago

File system developer hat on, a file system must make metadata changes visible immediately after the IO returned, even if it's not flushed to disk, or that's a bug in the FS. I'm not aware of such bugs recently so I'd be surprised we ran into one... But that is a valuable hint; if unused is based on file timestamp there are some filesystems with a second granularitywhere it could be easy to imagine that a timestamp didn't change during the duration of the test...

Unfortunately I just tried on one (after fiddling a bit, ext3 works with utf8 and has second granularity), and I can't reproduce with that either; I've checked with strace that the timestamp returned by stat() doesn't change during the duration of the test and the test still passes

$ strace -f -e %filesystem -v ... 2>&1 | grep '\\343'
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 8
[pid 3188683] statx(10, "\343\201\257\343\202\232\343\201\257\343\202\232.jpg", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16392, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", O_RDONLY|O_CLOEXEC) = 11
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 3188683] openat(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 11
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16393, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] unlink("/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg") = 0
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\257\343\202\232\343\201\257\343\202\232.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7f8a79bfd210) = -1 ENOENT (No such file or directory)
[pid 3188683] statx(AT_FDCWD, "/tmp/.tmpoK4TgP/media/\343\201\261\343\201\261.jpg", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_blksize=1024, stx_attributes=0, stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=16393, stx_size=12, stx_blocks=4, stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED|STATX_ATTR_AUTOMOUNT|STATX_ATTR_MOUNT_ROOT|STATX_ATTR_VERITY|STATX_ATTR_DAX, stx_atime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_ctime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_mtime={tv_sec=1723362773, tv_nsec=0} /* 2024-08-11T16:52:53+0900 */, stx_rdev_major=0, stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=1, stx_mnt_id=0x6fa}) = 0
[pid 3188683] unlinkat(4, "\343\201\261\343\201\261.jpg", 0) = 0

Is there another timestamp in the media.db ?

martinetd commented 1 month ago

rslib/src/collection/timestamps.rs apparently inserts times in milliseconds in media.db, so on the very powerful builders I could see the tests running in <1ms -- adding a 1ms delay somewhere could work around the issue if using a more precise timestamp is more work.