Closed tonyhutter closed 3 years ago
This is still WIP while I'm testing the BB API transfer resume.
I've substantially revamped this code, including adding BB API test cases. It should be ready to review. I'm removing the WIP tag.
Thanks, @tonyhutter . I've taken a first pass. I feel like this is complicated enough I'll want another pass or two as we work through it.
@adammoody thanks for taking a first look. If I didn't comment on something specifically, then assume I'll fix it in my next push.
@adammoody I think I addressed all your comments in my latest push.
Thanks, @tonyhutter . After those, I think we're on to the second phase of the resume review which is to consider failures at different points. I was first thinking we would do that now, but it could be a lot of work, so maybe we could merge this (and the delete crc PR) then consider it in a third PR. It's possible it may drive us to change how we have defined the AXL interface, though, so we should do it sooner rather than later if we go ahead and merge now.
Any snags are likely to show up after AXL completes a phase but the process dies just before AXL updates its state file to record the state change. That means the resume will start up in an old state. We need to think through each case in turn, meaing every AXL state x AXL transfer combination to make sure they all behave properly.
As an example, consider what happens if a process dies right here after it has renamed files but just before it marks the transfer as being complete: https://github.com/ECP-VeloC/AXL/blob/d738fb7a973367f57d9480cba9b2b63717920374/src/axl.c#L864
A resumed pthread transfer would start up and see the transfer as DISPATCHED (not COMPLETED). It also will not see the temp files, because they were in fact renamed. So I think it will start over from scratch. That's maybe inefficient, but I believe that might still work. It will just end up writing all of the files as temp files again and then renaming again to clobber the final files. I haven't looked through all of the code to make sure. But that illustrates the kind of analysis we need to do to make sure resume works.
Consider things if we had already implemented the delete-after-transfer feature: https://github.com/ECP-VeloC/AXL/issues/74
If we deleted the source files before writing COMPLETED to the state file, then the existing pthread resume would definitely no longer work. It would try to restart, but it would be missing its source files. We might need to add a new internal state to capture that distinction. The interactions between everything get complicated and sort of a puzzle we need to think through.
We are also going to want a way for an application to remove a state file. Right now, only AXL deletes that file and only during AXL_Free(id). That means a restarted application must be able to call AXL_Create in order to call AXL_Free, but there are several new ways AXL_Create might fail. For example the file might be corrupt, or the file might be valid, but the app might have lost track of the xfer type and/or name that was used in the original create call.
For now, we could say that it's the caller's responsibility to unlink the state file in those situations.
I keep wondering if we'll eventually need something akin to O_CREAT | O_EXCL | O_TRUNC in the create call, or perhaps define a new AXL_Open call that opens an existing state file without having to know what the xfer/state values were in advance. I think this will become apparent as we work through more failure situations.
@adammoody thanks for taking another look. My latest push addresses the new fixes you suggested.
As far a failure cases go, I agree that we should document any ways we can think of where resuming the transfer can go wrong. Then we can write test cases to exercise those scenarios to prove to ourselves that we have those bases covered. I'm glad you brought up delete-after-transfer, because that is a place that could bite us if we're not being careful.
@adammoody thanks for taking another look. My latest push addresses the new fixes you suggested.
As far a failure cases go, I agree that we should document any ways we can think of where resuming the transfer can go wrong. Then we can write test cases to exercise those scenarios to prove to ourselves that we have those bases covered. I'm glad you brought up delete-after-transfer, because that is a place that could bite us if we're not being careful.
Yes, documentation and test cases will be great. To get there, we need to work through the mental exercise to examine every single line of AXL code one by one, including every iteration of each for/while loop, to consider what will happen during a resume (or cancel) if the process had died just before or just after that line of code executes. The vast majority of lines won't make a difference, but some lines will be critical, and it's hard to know which ones are important without careful consideration of every line one by one. Easy. Just saying that we're getting into the "fun" part of fault tolerance now.
Latest round of changes are included in my latest push.
We should also consider adding a style checker into Travis.
@adammoody I just pushed the latest round of fixes
Latest round of changes are included in my latest push.
We should also consider adding a style checker into Travis.
Yeah, a style checker is a good idea. We could apply a consistent style across all of the ecp-veloc components.
I like tools like astyle/clang-format since they can help implement the changes. We could define an astyle or clang-format file to include in the components. I've seen some good clang-format definitions lately, and here is an example astyle file from mpifileutils, e.g.,
https://github.com/hpc/mpifileutils/blob/master/astyle.options
@tonyhutter or @CamStan , any idea why the travis job is stuck pending here?
I just re-pushed to kick off a new Travis test. Hopefully that will work
This was happening with KVTree yesterday too, but a re-push fixed that one.
I ended up messing with some access things, that might have messed it up since some of our components are hosted on travis-ci.com and some are still on the older one, travis-ci.org. I sent another access request that will need to be approved, so hopefully that will get it working again.
@tonyhutter , can you force a re-build on travis? I just clicked a button to enable things from an email that @CamStan triggered.
Hmm, same outcome. We could try a bigger hammer with an uninstall / reinstall the travis webhook. I do see the payload URL is pointing at travis-ci.org.
Might as well. Travis CI's migrate to switch to travis-ci.com is only in beta right now, but we could possibly contact them to migrate if the bigger hammer doesn't work.
Also, if the bigger hammer doesn't work, could disable the "Required status before merging" rule until we get it sorted. I believe the setting is under Settings > Branches.
Got them migrated and hopefully will work better now for all the components. The addresses in the READMEs, and possibly other places, will need to be updated to point to travis-ci.com now, but I can get to that eventually.
Thanks @tonyhutter and @CamStan !
Change the API so that
AXL_Create()
takes in a state_file argument, and remove state_file fromAXL_Init()
. Use macro mangling for backwards compatibility so that this doesn't break applications that call the functions the old way. We can get rid of the macro mangling once the FILO/VeloC are updated.Add
AXL_Resume(id)
to resume a transfer. This works for sync, pthreads, and BBAPI. For BBAPI, there's no way to resume a transfer, so we just keep the existing transfer going.Update
axl_cp
and test cases with all these changes.