Repair stall failures by GET with byte range

leonerd commented 10 years ago

Rather than failing entirely due to GET stall, it should be possible to continue with a GET with a byte range. Need to keep track of how much has been fetched so far.

leonerd commented 10 years ago

Before I get into this too much, it would be good to investigate if S3 supports the byte-range request header, and if so put that into NaWS:S3 first.

leonerd commented 10 years ago

Byte ranges turn out to be simple enough - trickier may be getting stall report + progress information out of NaHTTP into NaWS:S3 to drive a reliable retry loop. Will see how this goes.

leonerd commented 10 years ago

An initial hack at this is now done in NaWS:S3 source Needs Net::Async::HTTP -r302 and Net::Async::Webservice::S3 -r140 Will bump CPAN modules when this looks like it works.

leonerd commented 10 years ago

Latest attempt on that failed with OOM. I suspect maybe a reference loop or somesuch.. :( Devel::MAT to the rescue...

leonerd commented 10 years ago

Ahah! In my attempt to put in reliable resume after stall I accidentally captured all the file content in in-memory lexicals. Oops. With that now fixed, another attempt is running

leonerd commented 10 years ago

I now have a .pmat file. After much hunting around I have discovered that the stalled files are all waiting on requests from NaHTTP, on NaHTTP:Connection objects that no longer seem alive.

Specifically, those connection objects do have a fileno in their notifier name, indicating they were at one point connected, but they no longer have a read/write handle, and they themselves don't appear in the connections pool of the actual NaHTTP object. I suspect a cornercase of connections being closed without having requests queued on them cancelled.

leonerd commented 10 years ago

Furtheremore, the handle has 'handle_closing' => 1, and has no 'loop' member, suggesting it has been ->close'd. It also lacks a 'ready_queue'. Yet for some reason there's a request pending on it. It ought not be able to get in that state.

leonerd commented 10 years ago

\o/ I have managed to make a unit test case fail giving all the same symptoms in the explorer, so I now have a good case to fix it from.

leonerd commented 10 years ago

I believe that may now be fixed, so trying it again on the restore test box. Will see if it stalls more

leonerd commented 10 years ago

Woo \o/

$ sfs3 cmp --concurrent=4 --only=/snapshots/backup_20140108/ 13 /var/lib/cassandra/data ... All done - no differences found

I think that's cracked it. Needs bugfixes in NaHTTP and NaWS:S3, so I'll ship them to CPAN now

leonerd commented 10 years ago

Now on their way to CPAN. Will fix up the dep declarations here and close off bugs tomorrow once they're through the mirrors and installed on the test boxes.

leonerd commented 10 years ago

That worked fine. NaHTTP and NaWS:S3 now CPAN'ed. Will update the deps on sfs3 itself and then this will be done

leonerd commented 10 years ago

Fixed and working as of https://github.com/SocialFlowDev/SocialFlow-S3/commit/0f7722336c1384e32717defb9329590ff7464d7a

SocialFlowDev / SocialFlow-S3

Repair stall failures by GET with byte range #17