Spurious pipe errors crash the command pipe thread in process.py

hogwell commented 2 days ago

Description

Sometimes, commands are no longer processed. For example, the "Take Image" button does nothing.

Rebooting the computer to reload sometimes works.

Steps to Reproduce

Hard to reproduce because it seems to be intermittent.

Expected Behavior

Even though the image is properly refreshed, piped commands should continue to work.

Additional Information (Optional)

The following error is logged: ERROR: FIFO pipe busy. [Errno 11] Resource temporarily unavailable

This seems to end the command thread.

Pi Model

Zero

Operating System

Raspberry Pi OS Bullseye (11.x)

Checks

[X] I have not found any related issues
[X] I have updated to the latest version
[X] I want to contribute a fix

hogwell commented 2 days ago

I modified the code for setup_fifo() and read_pipe() to catch the BlockingIOError's and ignore them.

So far, this has fixed the problem for me.

kaihokori commented 2 days ago

@NhanDotJS will need to look into this and try recreate the issue

Chen-Loi commented 2 days ago

I think there may be 2 issues at play here. The first happens when there's something written in the pipe when the program launches, that one causes it to terminate immediately (which might be sidestepped by ignoring the BlockingIOError).

On the other hand, in regards to the issue of "Take Image" or sending pipe commands sometimes doing nothing- that might be a separate issue and here's my guess about what may be going on.

There's meant to be a configurable amount of delay that happens between reading commands from the pipe - this is meant to be controlled the "fifo_interval setting in the configuration file and was a feature from the original RaspiMJPEG (can't remember whether RPi Cam Interface also let you set the value from the frontend).

This means it ends up being possible for consecutive commands to end up queued in the pipe itself as, for example, something like "im\nim\nim", if they are all sent during this delay period, before being checked and read by the program.

But when the program finally reads from the pipe, each read from the pipe done by the program happens as a 256 character (256 bytes) block at a time. And I suspect this bug happens because when it reads this kind of queued up block of commands from the pipe, the parsing logic is reading that "im\nim\nim" all as a single "im" command. And then subsequently queues only one command for the read (instead of 3, in this case).

This behaviour something I noticed myself during testing, but never had the time to address during development. Fixing this will probably require adding some extra logic to check whether content being read from the pipe has any newlines and split it on the newlines so they can be parsed individually.

.........Or then again, it's possible the problem might also be caused by something else entirely.

NhanDotJS commented 1 day ago

I too have encountered this error during several testing. And I did applied a similar code to the PR as a temporary fix. However, I did not find the root cause of the situation, and by ignoring the error, we may run into further error. I will find out if this is the case, otherwise, your code could be good to merge.

Chen-Loi commented 1 day ago

From my understanding, the root cause of that particular issue is due to the pipe operating in non-blocking mode.

As far as I can tell, the old RaspiMJPEG program opens the pipe in non-blocking mode, so that the whole program doesn't just immediately stall until something is written into it - but then immediately switches it back into blocking mode right afterwards by clearing the nonblock flag using fcntl(fd[i], F_SETFL, 0).

Due to an oversight, that switch back to blocking mode was neglected to be implemented here.

This means the program currently keeps the pipe open for reading in non-blocking mode the whole time. And when you try to read from a non-blocking pipe while there is nothing inside it to read, that causes Errno 11 (also known as EAGAIN or EWOULDBLOCK). It was missed, because having "nothing to read" only happens if an empty pipe is simultaneously opened for writing at the same time as attempting to read (if it's not being opened for writing when empty, the reader just gets given an EOF to read instead, which doesn't trigger the error). So if the RPi Cam frontend opens the pipe and takes too long to write and close it, it would also trigger this error (at least in theory).

As for why this wasn't discovered earlier.... the detailed excuse is that because the RPi Cam Web Interface frontend conducts opening, writing and closing the pipe so quickly, during testing it never held the pipe open long enough to coincide with a read and cause the error. Testing did trigger the error when commands were sent to the pipe prior to launch/restart, but at the time it was believed that the error was caused by there being something remaining in the pipe rather than the exact opposite. Turns out, the error is caused not because of anything remaining in the pipe, but because the writer blocks and keeps the pipe open for writing, until the program opens the read-end up again on restart/launch. And at this point, the read operation used to flush the pipe on program launch reads nothing and throws the error (again, typically, if there was no writer holding the empty pipe open for writing, it would get an EOF and be happy).

Though more simply, I guess you could boil it down to spotty testing and unfamiliarity among the team with the detail workings of FIFO pipes and/or incomplete reverse-engineering of the RaspiMJPEG logic.

So, if my understanding of the root cause is right, fixing this problem can be done adding: import fcntl Into the app/core/process.py file and then inserting the line: fcntl.fcntl(CameraCoreModel.fifo_fd, fcntl.F_SETFL, 0) Just after the line in the setup_fifo() method that goes CameraCoreModel.fifo_fd = os.open(path, os.O_RDONLY, 0o666), which should be Line 129 in the current version of that file.

Adding that line should implement the switch back to blocking mode for the pipe missed during initial development.

hogwell commented 1 day ago

Interesting analysis and hypotheses...

I'm glad this particular error might be prevented or mitigated by better blocking mode management on the pipe.

Despite this, though I don't see any reason to silently exit the command thread when a pipe error occurs (for whatever reason) , or when garbage is received, etc.

Keeping the thread alive for reliability in processing commands should be the priority, even if other race conditions are present or future os updates cause a regression back to this problem again.

At least, that's my opinion - ignoring these errors (perhaps logging them) creates a more robust situation overall for command processing by the back end.

Chen-Loi commented 1 day ago

It's not an unreasonable take- in which case, it might also be worth adding in a try-except for the (only) other pipe os.read operation in the program, the one in process.py's read_pipe() method.

Unlike the one in setup_fifo(), the operation in read_pipe() isn't currently guarded by a try clause at all. So an error in reading the pipe there - which happens every event loop - will also make the pipe-reading thread invisibly crash (i.e. what would currently happen in prior-mentioned hypothetical scenario where the frontend holds writing-end of the nonblocking pipe open for too long) even though the command-execution thread keeps going. So for prioritizing robustness, you'd want to catch/ignore errors on that one as well.

Visibly logging errors like these in the schedule log file is certainly a good idea, but in the particular case for pipe reading would require some amount of further tweaking to enable. This is because schedule log files/filepaths are associated with a given camera (since their filepaths are specified by the configuration file associated with a camera) and printing to a schedule log requires calling a camera object's print_to_logfile() instance method.

However, neither setup_fifo() not read_pipe() currently get passed any camera objects.

For them to print to schedule log would therefore require either exposing/passing camera objects to them or else exposing/creating an exposed version of the print_to_logfile() method. The latter would require some further modification as the existing print_to_logfile() method references the camera index number and schedule log filepath associated with its particular camera instance. You'd have to set up some other way for an non-instance exposed version to fetch those values.

hogwell commented 9 hours ago

Re: logging from the command thread... Perhaps the "main camera" object could be used for the logging, or logging to all cameras??

Re: try-except in read_pipe... I did this in my fork and FWIW everything is running fine for me.

NhanDotJS commented 5 hours ago

The code was merged together with #7 , thanks to @hogwell . Current resolution: Ignore BlockingError, keeps command thread alive despite error. This issue will be closed, please re-open if the error persist or any error occurred by the fix!

Windermere-Technology / RasPyCam