timer accuracy v one shot timer v expiry time (issue_163.t)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run memcached/t/issue_163.t on a platform that has a course timer and isn't 
that fast - repeatedly

What is the expected output? What do you see instead?
Test should pass,

test fails sometimes with:
not ok 4 - Objects should be reclaimed
#   Failed test 'Objects should be reclaimed'
#   at t/issue_163.t line 27.
#          got: '0'
#     expected: '1'

What version of the product are you using? On what operating system?

1.7.1.1, built from source on ARM (Panda board) running Linaro 11.09 
3.0.0-1404-linaro-lt-omap kernel

Please provide any additional information below.

I believe the problem here is a combination of factors with memcached's timing 
and the way the test is written and the courseness of the timer on the board.
Bumping up the sleep from 2 to 3 seconds in the test fixes it - but there are 
some underlying issues.

The test stores some data with a 1 second expiry, waits 2 seconds and stores 
data again, testing that the slot got reclaimed after the expiry;

However:
  1) memcached's only stores a 1 second precision timer.
  2) clock_handler uses a 1 shot timer, resetting it every time it's called, and hence clock_handler gets called less regularly than once a second

So I added some debug:

a 1..7
b DAG: set_current_time: current_time=2 timer=1319726830/136748
c DAG: set_current_time: current_time=2 timer=1319726830/236723
d DAG: set_current_time: current_time=2 timer=1319726830/241087
e DAG do_item_alloc fall through: current_time=2 (new)exptime=3 
time=1319726830/247984
f ok 1 - stored key
g ok 2 - We should have requested some memory
h DAG: set_current_time: current_time=3 timer=1319726831/563994
i DAG do_item_alloc tail loop: current_time=3 refcount=0 exptime=3 
time=1319726832/296141
j DAG do_item_alloc fall through: current_time=3 (new)exptime=0 
time=1319726832/296752
k ok 3 - stored key
l not ok 4 - Objects should be reclaimed
m #   Failed test 'Objects should be reclaimed'
n #   at t/issue_163.t line 27.
o #          got: '0'
p #     expected: '1'
q ok 5 - deleted key
r DAG do_item_alloc fall through: current_time=3 (new)exptime=0 
time=1319726832/306609
s ok 6 - stored key
t ok 7 - we've not allocated and freed the same amount
u # Looks like you failed 1 test of 7.

So in this case the first store (on line e) happens when current_time is 2 at 
wallclock ...830.247984 seconds; and the expiry is set to 2+1=3

(line h) The clock_handler loop appears to have been very sluggish, and we get 
back to set_current_time at ...831.563994  ~1.3s later where we set 
current_time to 3 

Then we go and do the second set (line i), current time is still '3' yet the 
wallclock is just over 2 seconds since the original store. And since the expiry 
was 3 and current time is 3, we haven't expired and hence the test fails.

---
OK, so the OS seems to have been very sluggish at the event for clock_handler; 
I think the granularity of the timer is 128Hz, so I'm not quite sure why it 
took 1.3s rather than 1.008 - but the fact it's taking any longer than 1 (which 
is legal) means the 2second limit in the test 163 isn't safe.

Dave

Original issue reported on code.google.com by david.gi...@linaro.org on 27 Oct 2011 at 3:30

GoogleCodeExporter commented 9 years ago

Wow, a thorough report!

I'm sorry it was on a silly test like that. Some of the tests are subject to 
timer failure; on a busy system or even on a fast system they'll fail 
sometimes. I've been bumping the sleep from 2 to 3 when I find them. While 
clock_handler will change the time to the actual amount of seconds passed, it 
won't fire on a second alignment as you said. This means that occassionally 
time bumps forward two seconds on a tick, or whatever.

Dustin has a "time travel" test method that's in uh... the 1.6 tree (I think?) 
which completely removes the need for sleeping here at all.

In the meantime, I'll bump that guy to 3. Eventually we should have the tests 
repaired anyway.

Original comment by dorma...@gmail.com on 27 Oct 2011 at 5:53

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

No problem; I had to nail why it was happening to prove that it wasn't
my compiler
changes that were causing it.

Thanks; but I think maybe  the way your time sampling system works
together with exptime might
cause problems for people more generally.

memcached/doc/protocol.txt defines exptime as:

     'it is guaranteed that clients will not be able to
      retrieve this item after the expiration time arrives (measured by
      server time).'

I don't know if you define server time any where, but I guess people wouldn't
expect the sampled 1 second timing system, and it's effect that it could
actually hang around for almost 2s - although I guess people rarely
use that short/precise expiries in practice - it's just the word
'guaranteed' is pretty
strong.

Dave

Original comment by david.gi...@linaro.org on 28 Oct 2011 at 12:38

GoogleCodeExporter commented 9 years ago

It's not been an issue in practice. I've thought about a few ways to make it 
more accurate (which aren't too hard), but there're too many other important 
things that people do run into in production to care right now.

Having to wait for all those sleeps in the tests is more annoying than seconds 
ticking slightly off from reality :)

Original comment by dorma...@gmail.com on 28 Oct 2011 at 5:39

allen8807 / memcached

timer accuracy v one shot timer v expiry time (issue_163.t) #230