RevolutionPi / piControl

Kernel module for data exchange with RevPi I/O-Modules and Gateways
81 stars 24 forks source link

Revpi 2467/fix concurrent reset #71

Closed linosanfilippo-kunbus closed 2 years ago

iluminat23 commented 2 years ago

https://github.com/RevolutionPi/piControl/issues/70

l1k commented 2 years ago

The first three commits, which serialize multiple concurrent resets, look reasonable. But I'm struggling to understand why read() and write() would need to be serialized with resets? Access to the process image is serialized by lockPI. I think adding another lock on top of that doesn't provide any benefit but rather invites trouble. Am I missing something?

linosanfilippo-kunbus commented 2 years ago

Its the call to isRunning() which returns the internal state of the piBridge which changes during a reset. Right now the "reset lock" locks multiple things that may change during a reset. Not a very beautiful solution admittedly (it would be nicer to have finer grained locking but that would also mean more more locks and I doubt that this is worth the effort given that we are about to reimplement the whole piControl).

l1k commented 2 years ago

Okay, so I guess the idea of isRunning() was to prevent access to the process image while PiBridge communication is stopped. Perhaps the idea was that data in the process image is stale and shouldn't be visible to the user. However, access to the process image is locked by lockPI, so perhaps piControlReset() or one of its callees should acquire that lock.

The call to isRunning() is racy because piCore_g.eBridgeState may change immediately after the check. I think that can be fixed by briefly acquiring lockPI whenever piCore_g.eBridgeState is changed. And piControlRead() / piControlWrite() / KB_SET_VALUE / KB_GET_VALUE may only call isRunning() while holding lockPI.

Fine grained locking is definitely better and we need to invest some thought here to get it right. That will also help us with the reimplementation / refactoring because we get a better understanding how the existing, historically grown machinery works.

l1k commented 2 years ago

Okay and why do we need to serialize KB_GET_DEVICE_INFO / KB_GET_DEVICE_INFO_LIST with resets? I guess because a reset may cause addition or removal of devices, right?

Hm, if this is refactored the way we've discussed, I think the proper approach would be to use refcounting. We'd search the pibridge with device_for_each_child() and when finding a matching device, we'd call get_device() to bump the device's refcount and work with it regardless whether a concurrent reset is ongoing.

I'm trying to anticipate what this may look like in the future and what kind of locking we could add to make steps towards that future.

l1k commented 2 years ago

Regarding KB_FIND_VARIABLE, I think we need a lock to protect access to the configuration variable store piDev_g.ent.

The for-loop which iterates over the variable store and searches for a particular variable should be moved to a helper in piConfig.c so that the inner workings of the variable store are not exposed. Taking a lock to protect access to the variable store would then likewise be internal to piConfig.c.

A quick git grep 'for.*piDev_g.ent->i16uNumEntries' reveals another for-loop over the variable store in PiBridgeMaster_setDefaults() which initializes the process image from the variables' defaults. I think the process image and the variable store are two separate things and shouldn't be intermixed. Maybe we could provide a for_each_variable() helper to iterate over the variable store.

l1k commented 2 years ago

It looks like KB_SET_EXPORTED_OUTPUTS needs the same lock as KB_FIND_VARIABLE: It iterates over the piCopylist, which is a subset of the variable store. The piCopylist is likewise generated in piConfig.c, so access to it should probably be implemented with helpers in piConfig.c, similar to the variable store.

l1k commented 2 years ago

Regarding KB_DIO_RESET_COUNTER, I think the isRunning() check is only necessary to ensure that the pibridge bus is switched to I/O protocol. The DIOs may not be able to parse the telegram if the bus is switched to ModGateRS485 protocol. That's only indirectly related to resets. (The bus may be switched to ModGateRS485 protocol for other reasons than a reset, e.g. because a firmware update is performed.)

The other issue here is that the code checks whether there's actually a DIO at the address specified by the user and to do that, it iterates over the device list. The device list may change during a reset.

l1k commented 2 years ago

The same applies to KB_AIO_CALIBRATE.

Oddly enough, the ioctl refers to the AIO even though it is only applicable to the MIO. How did this ever pass review?

linosanfilippo-kunbus commented 2 years ago

I think the "AIO" in the ioctl name does not refer to the AIO device but to the analog input/output channels of the MIO which can be calibrated with this ioctl...

linosanfilippo-kunbus commented 2 years ago

The other issue here is that the code checks whether there's actually a DIO at the address specified by the user and to do that, it iterates over the device list. The device list may change during a reset.

Right. To make it clear, the reset lock is only supposed to protect data which may change during a reset and be accessed in one of the ioctls, like:

This is not super fine grained but its also not that coarse.

Also the locking with the resetLock does not include other variables/structs which also suffer from possible races but are also altered at other occasions then a reset:

These are other issues that we have to address separately.

l1k commented 2 years ago

Hm, KB_STOP_IO sets piDev_g.stopIO, which is honored by DIO, AIO and Gateway code, but not by MIO code. Looks like a bug.

l1k commented 2 years ago

Regarding KB_CONFIG_STOP, KB_CONFIG_SEND, KB_CONFIG_START, the reset lock is not sufficient because apparently, the user is supposed to issue the three ioctls in exactly this order (I guess with the possibility of multiple KB_CONFIG_SEND in-between).

If a KB_RESET ioctl is intermixed, the PiBridge will be in an unexpected state.

To get this 100% correct, we'd need a lock for the bridge state which is acquired on KB_CONFIG_STOP and released on KB_CONFIG_START. The ioctls do check the bridge state, so they can cope with an intermixed reset, the question is whether applications issuing the ioctls can cope with it. I'm not even sure why these ioctls are needed and which applications use them. End-of-line tester for gateways maybe?

l1k commented 2 years ago

So I think a lot of the commits in this series make sense but lockReset is a bit of a misnomer. What you're really doing is serializing certain user space accesses. Maybe lockIoctl or something like that?

linosanfilippo-kunbus commented 2 years ago

Hm, KB_STOP_IO sets piDev_g.stopIO, which is honored by DIO, AIO and Gateway code, but not by MIO code. Looks like a bug.

Also the DIO, AIO and gateway code are buggy: we cannot set a variable in one thread (e.g the one that executes the ioctl from userspace) and then expect the new value to be perceived by another thread if no synchronization primitives are involved. We could declare piDev_g.stopIO as volatile but that is frawned upon in kernel code. So we should rather use a dedicated lock for this (we could also use memory barriers but this is harder to get right and also there is no significant advantage over using a lock like e.g a mutex).

linosanfilippo-kunbus commented 2 years ago

So I think a lot of the commits in this series make sense but lockReset is a bit of a misnomer. What you're really doing is serializing certain user space accesses. Maybe lockIoctl or something like that?

Sure we can rename that lock. The easiest way would have been to simply take the lock before any ioctl is executed and then release it afterwards. Actually that was the case in earlier kernel versions, when the ioctl() function was protected by the BKL (big kernel lock - which is also the reason for the name "unlocked_ioctl" for the ioctl we use nowadays...)

l1k commented 2 years ago

Some ioctl types are fine to be submitted multiple times in parallel, e.g. GET_VALUE and SET_VALUE. So a global lock for all ioctls doesn't seem right. We need to apply it selectively only to certain ioctl types.

linosanfilippo-kunbus commented 2 years ago

I just pushed a new series of patches. This series is much shorter than the one before because it only fixes the reported issue, i.e. it protects the piControl module from multiple concurrent running ioctls that might end up calling PiBridgeMaster_Stop() in parallel and as a consequence stopping the revpi_gate_rcv_thread more than once.

l1k commented 2 years ago

In the KB_UPDATE_DEVICE_FIRMWARE commit, you need to add rt_mutex_unlock() calls in the two places where -EFAULT is returned in the Too many modules and No module to update checks.

linosanfilippo-kunbus commented 2 years ago

In the KB_UPDATE_DEVICE_FIRMWARE commit, you need to add rt_mutex_unlock() calls in the two places where -EFAULT is returned in the Too many modules and No module to update checks.

This code is commented out. Should not we simply remove this?