[bug]: MixMonitor audio missing start of audio

wphowell commented 6 months ago

Severity

Trivial

Versions

18.18.1

Components/Modules

app_mixmonitor

Operating Environment

FreePBX 16/CentOS 7

Frequency of Occurrence

Occasional

Issue Description

When the SAN is under high load, audio is missing the wav header and start of audio. Stereo recordings also don't match up and can't be merged into a single 2-channel audio recording.

We were able to patch app_mixmonitor.c by adding fflush(mixmonitor->mixmonitor_ds->fs_write->f); after the loop that writes frames in mixmonitor_thread -- this appears to ensure that the writes get to the disk and no audio data is being lost.

Relevant log output

No response

Asterisk Issue Guidelines

[X] Yes, I have read the Asterisk Issue Guidelines

jcolp commented 6 months ago

Do you plan on providing a pull request for including such a change?

seanbright commented 6 months ago

Is Asterisk crashing? The buffers should be flushed implicitly when the FILE *s are closed so it’s not clear how an fflush() makes a difference if Asterisk isn’t crashing.

wphowell commented 6 months ago

Asterisk does not crash. I'm not sure of the nuance of the filesystem behaviour when the SAN disks are under high load. We've seen that it seems to behave worse under XFS, but still see the issue on EXT4 as well.

Yes, I can submit a pull request with this change.

seanbright commented 6 months ago

The patch will need to make this new behavior optional (a new application argument) that defaults to off.

Flushing should be happening when the files are closed so this seems like the wrong approach to me.

wphowell commented 6 months ago

This is about the start of the audio file being lost and only the start. We find it missing the wav header and a good bit of audio at the beginning, the rest of the audio is there and has no issues.

I think making it optional is good since this is only occurring in a self-hosted environment where there is high IOPS on the SAN (would probably not occur on a public cloud), but when there is, the problem occurs frequently.

seanbright commented 6 months ago

The more you describe this the less inclined I am to accept a PR that does an fflush(...) - optional or not. Can you provide instructions (config. dialplan, etc.) to reproduce this?

wphowell commented 6 months ago

The underlying issue is a hot SAN as I've said earlier. The start of the recording doesn't get stored on the disk most likely because the initial data being sent is buffered and lost before it get written to the disk -- it's only the initial data as we have not seen lost of audio anywhere else in the file.

It should be able to be reproduced if you can limit the IOPS on your storage device to a very low number and have at least two competing resources with about 100 simultaneous calls doing mono and stereo (3 recordings) -- we find that if we're only doing mono that the loss doesn't seem occur; it only loses the start of the receive and/or transmit recordings, and those cannot be merged because one side has loss the other side does not.

What is the reason for not at least allowing this is an option? The patch seems to be working well for us.

seanbright commented 6 months ago

What is the reason for not at least allowing this is an option?

Because this…

The start of the recording doesn't get stored on the disk most likely because the initial data being sent is buffered and lost before it get written to the disk

… does not make any sense to me. Why would only the initial part of the file have issues getting flushed? If you copy a large file to your SAN when it is heavily loaded is the initial part missing?

This needs further analysis and I’m not sure how we do that without being able to replicate your environment. This seems like a very niche issue.

We used to restart Asterisk every night to reduce the chance of a crash. That worked well for us too, but it didn’t actually fix anything.

wphowell commented 6 months ago

It is certainly a niche issue and it would be difficult to replicate elsewhere, but this environment hosts about 20 FreePBX systems, so has an impact across numerous systems. Even with it occurring regularly in this environment, I'm not sure how to trace a low level issue like this -- it's definitely much, much worse on XFS though (about 5 times more prevalent on that filesystem). We did manage things before the patch by writing the recordings to tmpfs and then moving them afterwards, but this was not a viable long-term solution.

Having this as an option would be useful for our client at any rate. We can keep applying this patch manually to every system, but that's a lot of work on our side.

seanbright commented 6 months ago

Can you provide instructions (config. dialplan, etc.) to reproduce this?

If you copy a large file to your SAN when it is heavily loaded is the initial part missing?

Edit: Also, do you have an example 3 (combined, send, recv) files you can send me privately to review?

jcolp commented 6 months ago

Something else to note is that fflush doesn't actually guarantee a flush to disk. It flushes the user-space buffers. So therefore: Is it possible for that buffer to not get flushed to kernel space normally (because of some filesystem issue or disk slowness) , and gets overwritten?

jcolp commented 6 months ago

Additionally, until an analysis and explanation beyond "adding fflush solves this" is available and makes sense, then the change will not be merged. If other individuals are experiencing this issue and can provide additional information to narrow it down, then all the better as well.

jcolp commented 5 months ago

Is any further investigation on this going to be done to understand why fflush resolves it in your environment?

wphowell commented 5 months ago

We haven't come up with a good way to do that. I'd suggest that we close this until we can come up with a way to get to the root of the issue.

asterisk / asterisk