ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

Stress testing resumable transfers #83

Open tonyhutter opened 3 years ago

tonyhutter commented 3 years ago

I've been heavily stress testing axl_cp resumable transfers and (as @adammoody rightly predicted) there are bugs...

Test setup Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> . killall -9 axl_cp. Resume the transfers and wait for them to finish.

Here's some of the failures I've seen:

  1. Corrupt state_file?

    KVTree 1.0.0 ABORT: butte21: Failed to persist hash wrote 9121 bytes != expected 9127 @ /g/g0/hutter2/KVTree/src/kvtree.c:1132

    (I've only see this error once)

  2. Can't read state_file

    AXL_Create() failed (error -1)
    AXL 0.3.0 ERROR: butte21: Couldn't read state file correctly @ axl_alloc_id /g/g0/hutter2/AXL/src/axl.c:89

    The state_file defiantly existed since I check it with access() first. Later on I added a mutex in axl_write_state_file() thinking that would help, but still got the error. This error is semi-rare (happens roughly every 4-5 times I run my test).

  3. Files are missing

I often notice that files are missing on the destination side after the resume completes. When I look at the state_file after the killall -9 axl_cp, but before before the resume, I see that it may not have all 100 of the files it should be transferring (it may have like 60 or something). This makes me think it's getting killed in the middle of AXL_Add'ing all its files. The solution here would be to always AXL_Add() all your files before doing a resume. That way it would transfer the missing ones along with resuming the existing ones.

It seems to me that for resumes to truly work, we need either:

or

tonyhutter commented 3 years ago

I think not using the state_file for resumes may be the better approach. Consider that even with the state_file, you would still have to re-AXL_Add() all the old files before doing the AXL_Resume() (to protect against crashes while initially adding the files). So if we have to add the files on a resume anyway, and since we can derive all the information to resume from the partially transferred files themselves, we don't need the state_file.

adammoody commented 3 years ago

I've been heavily stress testing axl_cp resumable transfers and (as @adammoody rightly predicted) there are bugs...

Test setup Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> . killall -9 axl_cp. Resume the transfers and wait for them to finish.

Here's some of the failures I've seen:

1. Corrupt state_file?
KVTree 1.0.0 ABORT: butte21: Failed to persist hash wrote 9121 bytes != expected 9127 @ /g/g0/hutter2/KVTree/src/kvtree.c:1132

(I've only see this error once)

1. Can't read state_file
AXL_Create() failed (error -1)
AXL 0.3.0 ERROR: butte21: Couldn't read state file correctly @ axl_alloc_id /g/g0/hutter2/AXL/src/axl.c:89

Nice work, @tonyhutter ! Regardless of what we decide for AXL state files, you've exposed vulnerabilities in kvtree that we should fix up. SCR uses kvtree files in lots of other places that we need to work. One thing that comes to mind is that we could modify kvtree to write to a temp file and then call rename once it is complete instead of truncating and overwriting an existing file like it does now. That should help minimize the chances of reading back corrupt kvtree files. We'll instead get old kvtree files in that case, but at least it should be valid.

Would you please also open an issue on kvtree and refer back to this?

It'd be nice to include a test like this for kvtree.

tonyhutter commented 3 years ago

@adammoody I've opened https://github.com/ECP-VeloC/KVTree/issues/40 for the KVTree issues.

tonyhutter commented 3 years ago

I think I'm going to proceed with the "resumable transfers with no state_file" option then, since I think it will be less prone to error and easier to implement.