data61 / Mirza

6 stars 2 forks source link

Test fails intermittently #257

Closed sajidanower23 closed 5 years ago

sajidanower23 commented 5 years ago

The test revokePublicKey tests Expired but NOT revoked pub key passes most of the time, but occasionally fails:

uncaught exception: IOException of type UserError
user error (KeyErrorBRE AddedExpiredKey)

This needs investigation.

Original Author: ano002

(Moved with github-migration-0.1.0.0 (package github-migration-0.1.0.0 revision df9f38b))

axman6 commented 5 years ago

Might be related to #234

Original Author: mas17k

a-stacey commented 5 years ago

I have tried to reproduce this on my machine but I can't. I used the following script to automate 100 iterations:

#!/bin/bash
for ((i=1;i<=100;i++));
do
   ./run_tests.sh --ta '-p "Expired but NOT revoked pub key"' >> debug_testout.txt 2>&1
done

Can keep an eye on it with:

watch -n 1 "cat debug_testout.txt | grep -i "fail""
# or
watch -n 1 tail -n 20 debug_testout.txt

Code currently looks like:

    it "Expired but NOT revoked pub key" $ \brContext -> do
      nowish <- getCurrentTime
      let smallDelayInSeconds = 1
          nearExpiry = addUTCTime (fromInteger smallDelayInSeconds) nowish
      Just pubKey <- goodRsaPublicKey
      keyId <- testAppM brContext $ do
        user <- insertDummies
        keyId <- addPublicKey user pubKey (Just . ExpirationTime $ nearExpiry)
        pure keyId
      threadDelay $ fromIntegral $ secondsToMicroseconds smallDelayInSeconds
      myKeyState <- testAppM brContext $ do
        keyInfo <- getPublicKeyInfo keyId
        pure (keyInfoState keyInfo)
      myKeyState `shouldBe` Expired

So I'm guessing that it might be that the tolerance is to tight particularly with the file read in there if you machine is a little slower, so I would guess that maybe the following fixes:

-- Move the file read before obtaining the time (incase your machine disk read time is a little slower then mine).

    it "Expired but NOT revoked pub key" $ \brContext -> do
      Just pubKey <- goodRsaPublicKey
      nowish <- getCurrentTime
      let smallDelayInSeconds = 1
          nearExpiry = addUTCTime (fromInteger smallDelayInSeconds) nowish
      keyId <- testAppM brContext $ do
        user <- insertDummies
        keyId <- addPublicKey user pubKey (Just . ExpirationTime $ nearExpiry)
        pure keyId
      threadDelay $ fromIntegral $ secondsToMicroseconds smallDelayInSeconds
      myKeyState <- testAppM brContext $ do
        keyInfo <- getPublicKeyInfo keyId
        pure (keyInfoState keyInfo)
      myKeyState `shouldBe` Expired

I really think that the above is likely to be the issue, but in the advent that it doesn't work, would like to see the impact of the following:

-- Move the file read before obtaining the time (incase your machine disk read time is a little slower then mine) and give a little more tolerance on the test executing the line of code.
    it "Expired but NOT revoked pub key" $ \brContext -> do
      Just pubKey <- goodRsaPublicKey
      nowish <- getCurrentTime
      let smallDelayInSeconds = 2
          nearExpiry = addUTCTime (fromInteger smallDelayInSeconds) nowish
      keyId <- testAppM brContext $ do
        user <- insertDummies
        keyId <- addPublicKey user pubKey (Just . ExpirationTime $ nearExpiry)
        pure keyId
      threadDelay $ fromIntegral $ secondsToMicroseconds smallDelayInSeconds
      myKeyState <- testAppM brContext $ do
        keyInfo <- getPublicKeyInfo keyId
        pure (keyInfoState keyInfo)
      myKeyState `shouldBe` Expired

@ano002 would you mind running that script on your machine against the current head of master, first to make sure that you can reproduce the issue repeatedly and then against the head of as_257 (currently 0228259f2e1cc87e683a531342c8e0abd0698793). I've made a PR for this which is https://github.csiro.au/Blockchain/supplyChainServer/pull/284, if this works then you should just be able to merge and job done. I would suggest running the tests over lunch or in a separate checkout so that it doesn't interfere with your development work and changing files.

For reference I tested my grep by inserting the following line at the top of the test case:

False `shouldBe` True

Original Author: sta358

sajidanower23 commented 5 years ago

In master, I ran this:

#!/usr/bin/env bash

for i in `seq 100`
do
  echo $i
  ./run_tests.sh --ta '-p "Expired but NOT revoked pub key"' 2>&1 | egrep -A 2 -B 2 'FAIL'
done

I ran it in 278-add-users-transact-event, and it failed 1 time. I ran it in master afterwards: The test failed 3 times. It is interesting to note that the three times were very close to each other.

20
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests18287-0.log) FAIL (1.38s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)
21
22
23
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests18552-0.log) FAIL (1.50s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)
24
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests18643-0.log) FAIL (1.37s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)

I have also ran it 300 times (total) against the HEAD of as_257. The first 100 runs yielded 1 failure in the 85th run:

85
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests27484-0.log) FAIL (1.26s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)

The next 200 runs yielded quite a few more:

18
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests29226-0.log) FAIL (1.24s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)
19
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests29277-0.log) FAIL (1.14s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)
20
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests29329-0.log) FAIL (1.37s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)
21
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests29381-0.log) FAIL (1.34s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)

## more runs which passed
142
  HSpec
    revokePublicKey tests
      Expired but NOT revoked pub key: (Logging will be to: /tmp/businessRegistryTests3275-0.log) FAIL (1.17s)
        uncaught exception: IOException of type UserError
        user error (BRKeyErrorBRE AddedExpiredKeyBRKE)

I can confirm that the public key is being read before everything else (which is a change that we should have made anyway, as it makes more sense since we are doing time-sensitive stuff.

However, the results of my test runs tells me that it wasn't an issue with the read, as there isn't an appreciable difference between how the tests did in master and as_257. I would also find it unlikely that reading a file of 733 bytes would take more than a second.

$ du -b ./test/Mirza/Common/TestData/testKeys/goodJWKs/4096bit_rsa_pub.json
733 ./test/Mirza/Common/TestData/testKeys/goodJWKs/4096bit_rsa_pub.json

My prime suspicion is that it is perhaps a concurrency issue in the test, or some issue in the tests and not the source code, but I would want to investigate this further.

Original Author: ano002

sarafalamaki commented 5 years ago

There wont be a timing issue if you don't rely on timing... try doing #234, it's really not that hard, and see if it the tests still fail.

Original Author: fal05c

a-stacey commented 5 years ago

I don't follow your logic @fal05c, I don't understand how that is relevant in this context?

The time here should be nearly 1 second from now, so I don't understand how the discrepancy between pico and micro seconds is relevant?

To be clear, there is definitely a discrepancy in my understanding here because otherwise I would be able to explain why the test is failing but I can't, but that reasoning doesn't complete my understanding so need more explanation on why you think this will work?

Original Author: sta358

sajidanower23 commented 5 years ago

Results of investigation:

The main suspicion is that the function insertDummies takes too long to run, and 1s delay is not quite enough. This is the original function:

-- | Adds the dummy business and user and returns the user id and auth user.
insertDummies :: AppM BRContext BRError AuthUser
insertDummies = do
  businessPfx <- addBusiness dummyBusiness
  uid <- addUser dummyNewUser {newUserCompany=businessPfx}
  tableUser <- runDb $ getUserByIdQuery uid
  pure (tableToAuthUser . fromJust $ tableUser)

To check how long these queries take to run, the function was modified to add some logs:

-- | Adds the dummy business and user and returns the user id and auth user.
insertDummies :: AppM BRContext BRError AuthUser
insertDummies = do
  mark1 <- liftIO getCurrentTime
  businessPfx <- addBusiness dummyBusiness
  uid <- addUser dummyNewUser {newUserCompany=businessPfx}
  tableUser <- runDb $ getUserByIdQuery uid
  mark2 <- liftIO getCurrentTime
  --liftIO $ putStrLn $ (show  mark1) <> "   " <> (show mark2) <> "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
  liftIO $ putStrLn $ (show  (diffUTCTime mark2 mark1)) <> "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
  pure (tableToAuthUser . fromJust $ tableUser)

The results of running it:

$ for i in `seq 20`; do ./run_tests.sh --ta '-p "Expired but NOT revoked pub key"' 2>&1 | egrep 'XXXXX' | egrep -o '[0-9]\.[0-9]+s'; done
0.742247237s
0.725090403s
0.72489233s
0.727080162s
0.728179514s
0.722488442s
0.743047154s
0.72144989s
0.726389517s
0.770271299s
0.743890961s
0.74579874s
0.757211076s
0.734675037s
0.973582399s
0.735253066s
0.745946722s
0.753573776s
0.757537138s
0.725509096s

The queries take around 0.75s on average, with the max being 0.97 seconds in my machine. This is why the test fails sometimes: insertDummies takes too much time and occasionally a 1 second delay is not enough.

The delay is now being made 2s. The test should not fail again, at least not intermittently because of insertDummies taking too long.

286 has been opened with regards to optimising the queries.

Original Author: ano002