SpectralSequences / sseq

The root repository for the SpectralSequences project.
Apache License 2.0
25 stars 10 forks source link

More robust saving #75

Closed JoeyBF closed 2 years ago

JoeyBF commented 2 years ago

I was having some issues on the grid due to malformed save files. I submitted the job in a queue where tasks could be preempted, and it messed up the computation when the process was killed while writing to disk. I suggest we ignore save files that have invalid headers, and overwrite them when writing to disk. create_file is only ever called after checking if a save exists so that's a safe assumption.

dalcde commented 2 years ago

I'm not a huge fan of silently overwriting files. In the case of a wrong header, one possibility is that it is reading valid save files of other kinds of data, and we do not want to overwrite them.

I was having some issues on the grid due to malformed save files. I submitted the job in a queue where tasks could be preempted, and it messed up the computation when the process was killed while writing to disk.

I think this is very unlikely to cause malformed headers anyway? Instead it would reach EOF too early and crash.

create_file is only ever called after checking if a save exists so that's a safe assumption.

This is open to race conditions when two threads try to write to the same file.

I suppose you are encountering this in the case of Nassau, where we write partial progress as we do the computation, so it is prone to having partially written quasi-inverses. To resolve this particular case, I think one reasonable option is to say "if the qi file exists but the differential file does not, delete the qi file".

JoeyBF commented 2 years ago

I think it's not very likely that the file would have correct data but of the wrong type, since they're not even supposed to live in the same directory. That would suggest to me that something went terribly wrong somewhere else, so it might be a good idea to panic.

I don't preempt the Nassau resolution because it causes too many incomplete save files, I'm actually encountering this with secondary. It's true that the headers are not the problem per se, but validate_header still returns an UnexpectedEof which causes a panic right now because of unwrap. Maybe we should catch only those errors and panic on the others? I would also like the message to at least contain the filename so I know what to focus on.

I would like to avoid crashes because it requires some manual intervention and it also wastes the ongoing work of all other threads, which can be substantial.

dalcde commented 2 years ago

Are the broken files in fact empty? It seems reasonable to overwrite in this case. There is a race condition where another thread just opened the file and is starting to write to it, but I think if you are deploying incorrectly then it is unlikely that you can avoid all crashes this way.

Good point about other threads.

JoeyBF commented 2 years ago

I didn't check if they were empty actually, I just deleted them and restarted the computation. I think it's fair to assume that disk IO is thread-safe if we partition the work correctly, e.g. distributing by homological degree.

dalcde commented 2 years ago

The header is fairly short so I would be very surprised if you somehow ended up with half a header

JoeyBF commented 2 years ago

Then you're probably right, they should be empty. So should we only catch UnexpectedEof then?

Another downside to crashing is that it's hard to distinguish between computations finishing and crashing. I could look at the logs but there would be 100+ of them for large stems, and slurm job arrays don't return the exit code.

dalcde commented 2 years ago

I think the correct workflow should be

open_file: If the file is empty, return None. Only do this for uncompressed files.

create_file: If create_new fails, check if file is empty. If so, run again with create.

In the first case, BufRead has an implementation of has_data_left here: https://github.com/rust-lang/rust/pull/85815/files . In the second case you probably want to read the file metadata directly.

dalcde commented 2 years ago

Or perhaps if open_file finds the file to be empty, delete it directly. Then create_file will not have to check. I think we should print a log to stderr if this happens.

JoeyBF commented 2 years ago

Thanks! I was trying to modify SaveFile::open_file but I was having problems with zstd. I see that just modifying save::open_file solves the issue.