Occasional core dump when using the continuous recording API

gonzalocasas commented 4 years ago

I am seeing a very strange issue, which -I assume- is related to an incorrect usage of SoundCard on my part, but I cannot really pinpoint the problem.

I use the recorder() & player() methods of soundcard 0.3.3 to playback and record a sweep signal using a Scarlett 2i4 USB Analog Stereo soundcard on a linux machine with Ubuntu 16.04 (kernel 4.15.0-70-generic). The python script that does this is running as a systemd service, although that should make not difference.

The following is a simplified version of the code I use, the function validate is not included because it's unrelated, but basically, if the recording matches some criteria we continue, if not, we throw an exception:

    RECORDING_OFFSET_IN_MS = 60
    SOUNDCARD_BLOCKSIZE = 2048
    SAMPLE_RATE = 96000
    WAVE_SUBTYPE = 'FLOAT'

    sweep_data,  sweep_sample_rate  = soundfile.read(os.path.join(HERE, filename), dtype='float32')
    speaker = soundcard.get_speaker(device_name)
    mic = soundcard.get_microphone(device_name)

    samples_per_ms = SAMPLE_RATE // 1000
    samples_to_record = len(sweep_data) + (RECORDING_OFFSET_IN_MS * samples_per_ms)

    with speaker.player(sweep_sample_rate, channels=speaker.channels, blocksize=SOUNDCARD_BLOCKSIZE) as player:

        with mic.recorder(SAMPLE_RATE, channels=[0, 1], blocksize=SOUNDCARD_BLOCKSIZE) as recorder:
            player.play(sweep_data)

            record = recorder.record(samples_to_record)

            if not validate():
                raise Exception()

    soundfile.write(recording_filename, record, SAMPLE_RATE, subtype=WAVE_SUBTYPE)

The error occurs from time to time, we run this continuously for several hours, and this happens perhaps every 10 or 15 min. Strangely enough, sometimes this does not happen for many hours, and then suddenly it starts happening every multiple times in a row until it somehow goes back to normal.

Since the core dump cannot be caught, the only error logs I see are on syslog from the failing systemd service:

1814125-Dec 17 16:01:08 gkracousticslab-ThinkCentre-M910t sweep-signal[14888]: Assertion 's' failed at pulse/stream.c:1412, function pa_stream_connect_record(). Aborting.
1814288-Dec 17 16:01:08 gkracousticslab-ThinkCentre-M910t systemd[1245]: sweep-signal.service: Main process exited, code=dumped, status=6/ABRT
1814423-Dec 17 16:01:08 gkracousticslab-ThinkCentre-M910t systemd[1245]: sweep-signal.service: Unit entered failed state.
1814537:Dec 17 16:01:08 gkracousticslab-ThinkCentre-M910t systemd[1245]: sweep-signal.service: Failed with result 'core-dump'.
1814656-Dec 17 16:01:14 gkracousticslab-ThinkCentre-M910t systemd[1245]: sweep-signal.service: Service hold-off time over, scheduling restart.
1814791-Dec 17 16:01:14 gkracousticslab-ThinkCentre-M910t systemd[1245]: Stopped Sweep Signal playback/record service.
1814902-Dec 17 16:01:14 gkracousticslab-ThinkCentre-M910t systemd[1245]: Started Sweep Signal playback/record service.

Is there anything I am doing wrong in this snippet of code that could be causing this behavior? Are there more ways to debug this situation? Alternatively, are there ways to mitigate it, say, if I keep very long lived speaker.player and mic.recorder instances instead of using a context manager all the time, would that minimize the chances of failure, or will it make it worse?

Thanks!

bastibe commented 4 years ago

I have seen similar issues in other contexts as well. As far as I can tell, it happens only on Linux, very infrequently, only when recording, and not on my machine. Which is why I haven't been able to debug it, yet.

Since you seem to have a reproducible setup, there. Would you be able to help me track down the cause of this issue?

It seems that the error happens in _Recorder._connect_stream, when pa_stream_connect_record is called. It seems that for some reason, self.stream is not a valid stream at this point. self.stream is pretty much directly handed to connect_stream from pa_stream_new, though, so I don't see how that could be a problem.

The help me debug this, could you log the value of self.stream before connect_stream is called, and see if it is something different in the crashing cases? Then, we could possibly retry pa_steam_new, or at least crash gracefully from within Python.

gonzalocasas commented 4 years ago

Sure, I can add some log. But not sure exactly what it should be...logging the value of self.stream itself is probably useless because it looks like <cdata 'pa_stream *' 0x56362c22bd20>, maybe something like _pulse._pa_stream_get_state(self.stream)? or is there something else that makes more sense?

-- EDIT: Indeed, I have a very reproducible setup. EDIT2: Following the trace til here seems to suggest it's a locking issue, because reference count is >= 1, right?

bastibe commented 4 years ago

Fascinating!

Up until now I had assumed that the error happened on the preceding line (pa_assert(s)), which is why I had asked for printing self.stream, assuming that it was NULL sometimes for some reason.

But you are right! This is probably a reference counting issue. Do we have to pa_stream_ref the stream ourselves? The source code seems to do that on line 202.

rikvanriel commented 4 years ago

Does commit be44b6557d68147a5789f7174cf4880f2efe6901 fix this issue?

That commit was done to fix the pa_assert(s) problem, but the same root cause might be underlying both, depending on which memory contents get overwritten first after a reallocation?

bastibe / SoundCard

Occasional core dump when using the continuous recording API #69