Closed leonerd closed 10 years ago
Before I get into this too much, it would be good to investigate if S3 supports the byte-range request header, and if so put that into NaWS:S3 first.
Byte ranges turn out to be simple enough - trickier may be getting stall report + progress information out of NaHTTP into NaWS:S3 to drive a reliable retry loop. Will see how this goes.
An initial hack at this is now done in NaWS:S3 source Needs Net::Async::HTTP -r302 and Net::Async::Webservice::S3 -r140 Will bump CPAN modules when this looks like it works.
Latest attempt on that failed with OOM. I suspect maybe a reference loop or somesuch.. :( Devel::MAT to the rescue...
Ahah! In my attempt to put in reliable resume after stall I accidentally captured all the file content in in-memory lexicals. Oops. With that now fixed, another attempt is running
I now have a .pmat file. After much hunting around I have discovered that the stalled files are all waiting on requests from NaHTTP, on NaHTTP:Connection objects that no longer seem alive.
Specifically, those connection objects do have a fileno in their notifier name, indicating they were at one point connected, but they no longer have a read/write handle, and they themselves don't appear in the connections pool of the actual NaHTTP object. I suspect a cornercase of connections being closed without having requests queued on them cancelled.
Furtheremore, the handle has 'handle_closing' => 1, and has no 'loop' member, suggesting it has been ->close'd. It also lacks a 'ready_queue'. Yet for some reason there's a request pending on it. It ought not be able to get in that state.
\o/ I have managed to make a unit test case fail giving all the same symptoms in the explorer, so I now have a good case to fix it from.
I believe that may now be fixed, so trying it again on the restore test box. Will see if it stalls more
Woo \o/
$ sfs3 cmp --concurrent=4 --only=/snapshots/backup_20140108/ 13 /var/lib/cassandra/data ... All done - no differences found
I think that's cracked it. Needs bugfixes in NaHTTP and NaWS:S3, so I'll ship them to CPAN now
Now on their way to CPAN. Will fix up the dep declarations here and close off bugs tomorrow once they're through the mirrors and installed on the test boxes.
That worked fine. NaHTTP and NaWS:S3 now CPAN'ed. Will update the deps on sfs3 itself and then this will be done
Fixed and working as of https://github.com/SocialFlowDev/SocialFlow-S3/commit/0f7722336c1384e32717defb9329590ff7464d7a
Rather than failing entirely due to GET stall, it should be possible to continue with a GET with a byte range. Need to keep track of how much has been fetched so far.