ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

Add resumeable transfers, update state_file #77

Closed tonyhutter closed 3 years ago

tonyhutter commented 4 years ago
tonyhutter commented 3 years ago

This is still WIP while I'm testing the BB API transfer resume.

tonyhutter commented 3 years ago

I've substantially revamped this code, including adding BB API test cases. It should be ready to review. I'm removing the WIP tag.

adammoody commented 3 years ago

Thanks, @tonyhutter . I've taken a first pass. I feel like this is complicated enough I'll want another pass or two as we work through it.

tonyhutter commented 3 years ago

@adammoody thanks for taking a first look. If I didn't comment on something specifically, then assume I'll fix it in my next push.

tonyhutter commented 3 years ago

@adammoody I think I addressed all your comments in my latest push.

adammoody commented 3 years ago

Thanks, @tonyhutter . After those, I think we're on to the second phase of the resume review which is to consider failures at different points. I was first thinking we would do that now, but it could be a lot of work, so maybe we could merge this (and the delete crc PR) then consider it in a third PR. It's possible it may drive us to change how we have defined the AXL interface, though, so we should do it sooner rather than later if we go ahead and merge now.

Any snags are likely to show up after AXL completes a phase but the process dies just before AXL updates its state file to record the state change. That means the resume will start up in an old state. We need to think through each case in turn, meaing every AXL state x AXL transfer combination to make sure they all behave properly.

As an example, consider what happens if a process dies right here after it has renamed files but just before it marks the transfer as being complete: https://github.com/ECP-VeloC/AXL/blob/d738fb7a973367f57d9480cba9b2b63717920374/src/axl.c#L864

A resumed pthread transfer would start up and see the transfer as DISPATCHED (not COMPLETED). It also will not see the temp files, because they were in fact renamed. So I think it will start over from scratch. That's maybe inefficient, but I believe that might still work. It will just end up writing all of the files as temp files again and then renaming again to clobber the final files. I haven't looked through all of the code to make sure. But that illustrates the kind of analysis we need to do to make sure resume works.

adammoody commented 3 years ago

Consider things if we had already implemented the delete-after-transfer feature: https://github.com/ECP-VeloC/AXL/issues/74

If we deleted the source files before writing COMPLETED to the state file, then the existing pthread resume would definitely no longer work. It would try to restart, but it would be missing its source files. We might need to add a new internal state to capture that distinction. The interactions between everything get complicated and sort of a puzzle we need to think through.

adammoody commented 3 years ago

We are also going to want a way for an application to remove a state file. Right now, only AXL deletes that file and only during AXL_Free(id). That means a restarted application must be able to call AXL_Create in order to call AXL_Free, but there are several new ways AXL_Create might fail. For example the file might be corrupt, or the file might be valid, but the app might have lost track of the xfer type and/or name that was used in the original create call.

For now, we could say that it's the caller's responsibility to unlink the state file in those situations.

I keep wondering if we'll eventually need something akin to O_CREAT | O_EXCL | O_TRUNC in the create call, or perhaps define a new AXL_Open call that opens an existing state file without having to know what the xfer/state values were in advance. I think this will become apparent as we work through more failure situations.

tonyhutter commented 3 years ago

@adammoody thanks for taking another look. My latest push addresses the new fixes you suggested.

As far a failure cases go, I agree that we should document any ways we can think of where resuming the transfer can go wrong. Then we can write test cases to exercise those scenarios to prove to ourselves that we have those bases covered. I'm glad you brought up delete-after-transfer, because that is a place that could bite us if we're not being careful.

adammoody commented 3 years ago

@adammoody thanks for taking another look. My latest push addresses the new fixes you suggested.

As far a failure cases go, I agree that we should document any ways we can think of where resuming the transfer can go wrong. Then we can write test cases to exercise those scenarios to prove to ourselves that we have those bases covered. I'm glad you brought up delete-after-transfer, because that is a place that could bite us if we're not being careful.

Yes, documentation and test cases will be great. To get there, we need to work through the mental exercise to examine every single line of AXL code one by one, including every iteration of each for/while loop, to consider what will happen during a resume (or cancel) if the process had died just before or just after that line of code executes. The vast majority of lines won't make a difference, but some lines will be critical, and it's hard to know which ones are important without careful consideration of every line one by one. Easy. Just saying that we're getting into the "fun" part of fault tolerance now.

tonyhutter commented 3 years ago

Latest round of changes are included in my latest push.

We should also consider adding a style checker into Travis.

tonyhutter commented 3 years ago

@adammoody I just pushed the latest round of fixes

adammoody commented 3 years ago

Latest round of changes are included in my latest push.

We should also consider adding a style checker into Travis.

Yeah, a style checker is a good idea. We could apply a consistent style across all of the ecp-veloc components.

I like tools like astyle/clang-format since they can help implement the changes. We could define an astyle or clang-format file to include in the components. I've seen some good clang-format definitions lately, and here is an example astyle file from mpifileutils, e.g.,

https://github.com/hpc/mpifileutils/blob/master/astyle.options

adammoody commented 3 years ago

@tonyhutter or @CamStan , any idea why the travis job is stuck pending here?

tonyhutter commented 3 years ago

I just re-pushed to kick off a new Travis test. Hopefully that will work

CamStan commented 3 years ago

This was happening with KVTree yesterday too, but a re-push fixed that one.

I ended up messing with some access things, that might have messed it up since some of our components are hosted on travis-ci.com and some are still on the older one, travis-ci.org. I sent another access request that will need to be approved, so hopefully that will get it working again.

adammoody commented 3 years ago

@tonyhutter , can you force a re-build on travis? I just clicked a button to enable things from an email that @CamStan triggered.

adammoody commented 3 years ago

Hmm, same outcome. We could try a bigger hammer with an uninstall / reinstall the travis webhook. I do see the payload URL is pointing at travis-ci.org.

CamStan commented 3 years ago

Might as well. Travis CI's migrate to switch to travis-ci.com is only in beta right now, but we could possibly contact them to migrate if the bigger hammer doesn't work.

Also, if the bigger hammer doesn't work, could disable the "Required status before merging" rule until we get it sorted. I believe the setting is under Settings > Branches.

CamStan commented 3 years ago

Got them migrated and hopefully will work better now for all the components. The addresses in the READMEs, and possibly other places, will need to be updated to point to travis-ci.com now, but I can get to that eventually.

adammoody commented 3 years ago

Thanks @tonyhutter and @CamStan !