Closed acostach closed 2 years ago
@acostach As I understand it rename
is not atomic on FAT file systems in the face of power failure (I'm using this thread as a reference). If the algorithms that Balena uses assume atomicity of rename
, corruption will continue to happen. I think that some sort of user-space journaling would be needed to really prevent this issue. I think something along the lines of:
on boot:
if exists("config.json.<hash>"):
if hash("config.json.<hash>") != hash:
rm("config.json.<hash>") # rollback transaction
else:
copy("config.json.<hash>", "config.json.tmp")
rename("config.json.tmp", "config.json")
fsync()
if hash("config.json") == hash:
rm("config.json.<hash>")
def modify(new_content):
write(format!("config.json.{}", hash(new_content)), new_content)
fsync()
copy("config.json.<hash>", "config.json.tmp")
rename("config.json.tmp", "config.json")
fsync()
if hash("config.json") == hash:
rm("config.json.<hash>")
might be needed to correctly modify a file on a FAT16 file system.
cc @alexgg @mtoman @markcorbinuk
[nazrhom] This issue has attached support thread https://jel.ly.fish/2d88fb06-4650-497d-b99c-300e09267d46
If there is no hard requirement to use FAT16 as the boot partition fs, I think using ext3 or ext4 should make rename
actually have the semantics that we want. It seems that uboot does support ext* filesystems, so I don't see any obvious reasons why FAT16 would be required, but I may be missing something.
This blog post has some info about the semantics of different linux operating systems: https://blogs.gnome.org/alexl/2009/03/16/ext4-vs-fsync-my-take/
@alexgg, thanks for starting work on this! I saw that you started the fat-rename branch, and since journalling is always a huge pain I decided to give it a read. Sorry if I jumped the gun a little on this, but I find this stuff fun :). It looks pretty good, but I have a few concerns:
All these notes are in relation to the os-helpers-fs file.
It looks like L196 assumes that there will only ever be one md5sum file in place. We can't make this assumption because there might be multiple crashes before we get a chance to clean up the extra files. We need to loop over all the md5sum files (preferably, but not crucially in timestamp order) to be correct (just deleting them to roll back won't work because if a md5sum file exists, we might have already corrupted the config.json).
L203 looks wrong to me. Renames are not atomic on FAT, so
if the power dies in the middle of the mv
command we can easily
end up in a state where both the target file and source file have
only partially been copied over. The solution to this is to use
a tmp file between the backup file and the target file like I did
in the psudocode that I posted.
There is an extra sync on L239. If the backup file is left around or currupted, the on-boot journaling process (verify_integrity) should just fix things up for us so there is no need to be careful here.
Hi @ethanpailes, thanks for taking the time to review this. As you can guess the branch is just a proof-of-concept which I worked on to prepare for the internal discussions around this topic. We will develop a centralized journalled write, but it may not be in as done in the branch above.
Anyway, it is good to discuss about the implementation. I was trying to find the right balance between complexity and functionality and I did make some assumptions that I think hold true but will need to be verified.
One assumption is that only the files that are written by the OS need to be protected, and according to our analysis only config.json
is currently being modified. Extending the mechanism to cover all possible accesses seems over-engineering as in the long run we plan to move content that needs to be written to out of the boot partition and leave all the contents in the boot partition as read-only.
Another assumption has to do with the multiple crashes theory. The OS will always read and integrity check/recover any file before any process writes to it. This holds true for config.json
which was the focus of the branch above, so we will not get multiple crashes before recovering.
I think you are right about the mv
in https://github.com/balena-os/meta-balena/blob/66c382468ebbb23c1abd6d3db04eecb1e2340c1a/meta-balena-common/recipes-support/os-helpers/os-helpers/os-helpers-fs#L203, I will need to revisit it.
Glad to keep discussing the details.
@alexgg Oh I see what you mean. So long as the recovery process is run to completion before any of the rest of the application gets to run or try to modify the config.json, we should be safe. :+1: That's a little subtle, so it might be worth adding a note about that invariant to the comment on integrity_check
.
Glad to keep discussing the details.
Awesome! I have a bit of experience writing transactional systems, so hopefully I can be a useful resource.
@majorz This is tracking what we discussed yesterday
[alexgg] This issue has attached support thread https://jel.ly.fish/7a4ac553-fe30-4ae5-aba2-40f32cfb7611
[acostach] This issue has attached support thread https://jel.ly.fish/094dc5d9-6e39-40fa-8e70-4bbe69f49cab
From v2.101.0 onwards the OS using the fatrw
CLI application to perform safe IO to vfat partitions.
config.json is modified by multiple services and a power cut can render this file corrupt, although atomic operations are being used.
List of services and details available in this OS thread.