GiraffaFS / giraffa

Giraffa FileSystem (Slack: giraffa-fs.slack.com)
https://giraffa.ci.cloudbees.com
Apache License 2.0
17 stars 6 forks source link

Fix intermittent failure from testLeaseRecovery because of race conditon #196

Closed weilintsaiWand closed 9 years ago

weilintsaiWand commented 9 years ago

This is a sub issue from #184. There are 3 types of failure in #184, the first one of them are clarified while the other two are still under investigation. Hence, we create this issue to focus on the first one so we can solve #184 step by step.

Failure Log

Running org.apache.giraffa.TestLeaseManagement
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.169 sec <<< FAILURE! - in org.apache.giraffa.TestLeaseManagement
testLeaseRecovery(org.apache.giraffa.TestLeaseManagement)  Time elapsed: 5.042 sec  <<< FAILURE!
java.lang.AssertionError:
Expected: is null
     got: <FileLease[holder=DFSClient_NONMAPREDUCE_369262095_1, path=/testLeaseRecovery, lastUpdate=1435994039179]>

        at org.junit.Assert.assertThat(Assert.java:778)
        at org.junit.Assert.assertThat(Assert.java:736)
        at org.apache.giraffa.TestLeaseManagement.testLeaseRecovery(TestLeaseManagement.java:180)

Root Cause

It's a race condition between renewLease() and internalReleaseLease(). Whenever a new file is created, it will create a block and then call renewLease() to set lease. However, we have another running process to check if the lease expired periodically. Since the lease expiration time is set to be super short in our test. The file can be expired and can be closed by internalReleaseLease() before renewLease() taking place. Therefore, the lease is set to non-null by renewLease() after the file is closed. However, the lease is expected to be null since we think the file is closed. As a result, the assert report a failure because the lease is not null

weilintsaiWand commented 9 years ago

First stable version for review is ready #199 . Will run regression tonight. Can commit tomorrow if regression pass.

Update

Btw, this patch pass 200 times regression. Although it can not finish 2000 times regression due to unknown reasons.

shvachko commented 9 years ago

Sounds like the right solution. Couple of suggestions on code organisation:

  1. Rename checkIfRacing() to checkFileClosed()
  2. The put parameter should be just the key as you don't use anything else from the put itself.
  3. We can remove isFileStateEquals and just add its contents into checkFileClosed(). The latter method is small enough, and you actually do not isFileStateEquals anywhere else.
  4. Should we check (kvs.size() == 1). I think it would be more generic just to look for FileField.LEASE. We may later need to update other fields while updating the lease.
weilintsaiWand commented 9 years ago

Update is ready for second review.

One thing is worth mentioning:

In checkFileClosed(), we check if someone tries to update lease while the file is closed. There are two situations:

  1. kvs.size() == 1. Which means it from updateINodeLease(). We discard the update this case
  2. Otherwise, we leave a warning here since we haven't decided how to handle it.
shvachko commented 9 years ago

Great. Looks good. We should also add a LOG.warn() in case when the lease update comes after close: Attempt to update lease for a file that has been already closed. Ignoring.