enbility / eebus-go

EEBUS protocol implementation in go
https://enbility.net
MIT License
52 stars 14 forks source link

Potential RaceCondition in CS-LPC and CS-LPP in WriteApprovalCallbacks #104

Closed sthelen-enqs closed 2 weeks ago

sthelen-enqs commented 4 weeks ago

When a Feature for a LocalEntity has multiple WriteApprovalCallbacks (e.g. LPC+LPP) and a command that requires write approval is sent for one of these features, depending on the order in which the WriteApprovalCallbacks are executed a nil pointer dereference can occur causing a segmentation fault.

In general the segmentation fault occurs when the ApproveOrDenyWrite method is called on a ship-go FeatureLocalInterface, the passed Message is not complete (i.e. contains only a valid MsgCounter) and "completes" the number of required approvals such that processWrite is then called with an incomplete Message.

One potential fix for this issue would be splitting the public ApproveOrDenyProductionLimit messages into a public and private part where the private part directly takes a *shipapi.Message which it can pass to ApproveOrDenyWrite and can be called from the WriteApprovalCallback. Another potential fix would be to adapt ApproveOrDenyWrite in ship-go to cache the "most valid" message and pass that through to processWrite instead of passing the last message.

If we decide on the first option, we should definitely document the API contract for ApproveOrDenyWrite that it only accepts "valid" messages and may crash when passed incomplete Messages.

Backtrace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8ca891]

goroutine 181 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000000300, 0xc000632008)
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000000300, 0xc000632008, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc0002b2cc0, 0x128, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/public.go:151 +0x246
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 122
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/usecase.go:148 +0x325

Illustrative Example

  1. A device registers the usecases cs/lpc and cs/lpp which both register their loadControlWriteCBs.
  2. The device receives an LPC command and the WriteApprovalCallbacks for both use cases are entered in separate goroutines "simultaneously" (LPC, LPP)
  3. The LPC handler proceeds through line 140 and calls the EventCB registered on the LPC use-case.
  4. The LPC EventCB calls ApproveOrDenyConsumptionLimit which passes through to line 159 and calls ApproveOrDenyWrite
  5. ApproveOrDenyWrite runs through until line 215 where since it's the first handler out of two handlers total it sets r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1 and returns (1 < 2 line 222)
  6. The LPP handler then proceeds through to line 148 because limitId != item.LimitId for all entries in the LoadControlLimitData and calls ApproveOrDenyProductionLimit with the MsgCounter (and only the MsgCounter of the message)
  7. ApproveOrDenyProductionLimit continues until line 143 where it enters the !ok branch because there are no pending LPP limits.
  8. ApproveOrDenyProductionLimit then creates a new spineapi.Message which only has RequestHeader.MsgCounter set (all other fields are nil) and calls ApproveOrDenyWrite
  9. ApproveOrDenyWrite continues until line 215 sets r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1 + 1 passes through the check on line 222 (2 >= 2) and continues until line 236 to call processWrite
  10. processWrite calls executeWrite which calls msg.Cmd.Data() which returns errors.New("Data not found in Cmd") because Cmd is zero-initialized and all its fields are nil.
  11. executeWrite then returns model.NewErrorType(model.ErrorNumberTypeCommandNotSupported, err.Error()) and execution continues in processWrite on line 674 with a call to msg.FeatureRemote.Device().Sender().ResultError(msg.RequestHeader, r.Address(), err) which proceeds to segfault on calling the Device() method of the FeatureRemote interface which was zero-initialized to nil.

This segfault also occurs when the LPP handler completes first for an LPP command (then segfaulting during the LPC-handler). The segfault does not occur when the LPC handler completes second for an LPC command or the LPP handler completes second for an LPP command as then the actual message is retrieved from the pendingLimits map and passed to ApproveOrDenyWrite.

sthelen-enqs commented 4 weeks ago

The simplest way to reproduce this is to replace lpp/usecase.go line 148 with

    go func() {
        time.Sleep(1 * time.Second)
        e.ApproveOrDenyProductionLimit(*msg.RequestHeader.MsgCounter, true, "")
    }()

And then send an LPC command while both LPC and LPP are active. This will guarantee that the LPP-Handler is called after the LPC-Handler for all commands (which will segfault for LPC commands).

sthelen-enqs commented 4 weeks ago

I've dropped an implementation of the first option in #105

DerAndereAndi commented 3 weeks ago

Thank you @sthelen-enqs for this tremendous bug report and the PR! This looks awesome.

I'll try to look into it as soon as possible.

DAMEK86 commented 3 weeks ago

@sthelen-enqs, can you provide the affected eebus-go and ship revisions? I don't see this behavior. If you like we can also have a debug session to check the differences between our implementations. ;)

Greetings from Ulm

sthelen-enqs commented 3 weeks ago

I could reproduce the issue with at least the current versions (github.com/enbility/eebus-go v0.6.1, github.com/enbility/ship-go v0.5.3, github.com/enbility/spine-go v0.6.2). I can try creating a minimal reproducer for the issue

sthelen-enqs commented 3 weeks ago

I threw up a branch with a minimal reproducer for the segmentation fault here: https://github.com/ConsolinnoEnergy/eebus-go/tree/test/issue-104-race-condition-reproducer on top of cmd/hems. To reproduce the crash you can run this (slightly) modified hems runner twice on the same machine (or the same network) and both nodes should crash ~2s after the nodes connect to each other (the LPC-client can also be swapped out with any other compliant LPC-client).

The crashing command output should then look something like this:

% go run cmd/hems/main.go 2345 $(<scratch/client/ski) scratch/server/cert.pem scratch/server/key.pem
2024-08-19 09:53:34 INFO  Local SKI:  7fb83ba42d6b4bdf66c76bb7e4191c01389a2e3b
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 64 and limit 5000 W
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]

goroutine 209 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc00062d608)
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc00062d608, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x40, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 208
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2
go run cmd/hems/main.go 5432 $(<scratch/server/ski) scratch/client/cert.pem scratch/client/key.pem
2024-08-19 09:53:34 INFO  Local SKI:  84fee6b0109a940e6ae24ff18ee0c8bfa6810514
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 63 and limit 5000 W
2024/08/19 09:53:38 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]

goroutine 292 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc0003d2588)
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc0003d2588, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x3f, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 291
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2
DAMEK86 commented 3 weeks ago

hey,

we were able to reproduce this for LPP as well, not however for LPC so far

 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x49c6c0]

 goroutine 359 [running]:
 github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0x4000136480, 0x400015a008)
         github.com/enbility/spine-go@v0.6.1/spine/feature_local.go:674 +0xa0
 github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0x4000136480, 0x400015a008, {0x0, 0x0})
         github.com/enbility/spine-go@v0.6.1/spine/feature_local.go:236 +0x478
 github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).ApproveOrDenyConsumptionLimit(0x400017bb60, 0xc3f5, 0x1, {0x0, 0x0})
         github.com/enbility/eebus-go@v0.6.2-0.20240731125756-80409c98e64f/usecases/cs/lpc/public.go:151 +0x260
 created by github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).loadControlWriteCB in goroutine 314
         github.com/enbility/eebus-go@v0.6.2-0.20240731125756-80409c98e64f/usecases/cs/lpc/usecase.go:148 +0x344

EDIT: The error appears only after LPC has been triggered. it's super easy to reproduce

DerAndereAndi commented 2 weeks ago

Thanks @DAMEK86. Where you also able to test the suggested PR ?

DAMEK86 commented 2 weeks ago

ups, I tried it a few minutes ago with the latest keo ctrlbox implementation using replace github.com/enbility/eebus-go => github.com/ConsolinnoEnergy/eebus-go v0.0.0-20240815114518-5c7fc4dc655b.

No crash with the changes so far -> nice work @sthelen-enqs!

DerAndereAndi commented 2 weeks ago

Thanks for all this @sthelen-enqs ! Much appreciated