diegoroux / haiku

The Haiku operating system. (Pull requests will be ignored; patches may be sent to https://review.haiku-os.org).
Other
0 stars 0 forks source link

Page fault in media_server_addon and media_server #3

Closed diegoroux closed 1 month ago

diegoroux commented 1 month ago

media_server_addon and media_server sometimes stops due to a page_fault.

vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x24, ip 0x21f8ce74391, write 0, user 1, exec 0, thread 0x13a
debug_server: Thread 314 entered the debugger: Segment violation
stack trace, current PC 0x21f8ce74391  </boot/system/lib/libmedia.so> _ZN11BTimeSource11RealTimeForEll + 0x11:
  (0x7fbba958c3e0)  0x21f8ce5dde3  </boot/system/lib/libmedia.so> _ZN17BMediaEventLooper11ControlLoopEv + 0x183
  (0x7fbba958c4c0)  0x21f8ce5d9ef  </boot/system/lib/libmedia.so> _ZN17BMediaEventLooper19_ControlThreadStartEPv + 0x1f
  (0x7fbba958c4e0)  0x72a3668dd9  </boot/system/lib/libroot.so> thread_entry + 0x19

debug_server: Thread 329 entered the debugger: General protection fault
stack trace, current PC 0x46aa7cf11b  </boot/system/lib/libmedia.so> _ZN12BMediaRoster12GetLiveNodesEP14live_node_infoPiPK12media_formatS5_PKcm + 0x17b:
  (0x7f87769c41e0)  0x161b8c9378c  </boot/system/servers/media_server> _ZN14DefaultManager13_FindPhysicalEPVijb10media_type + 0x18c
  (0x7f87769c4f50)  0x161b8c93e64  </boot/system/servers/media_server> _ZN14DefaultManager13_RescanThreadEv + 0x1c4
  (0x7f87769c5000)  0x161b8c93f09  </boot/system/servers/media_server> _ZN14DefaultManager13rescan_threadEPv + 0x9
  (0x7f87769c5010)  0x15d8b45add9  </boot/system/lib/libroot.so> thread_entry + 0x19

Thread 329 seems to fail with this error message:

SendToPort: write_port failed, msg code 0x703, port 233: Bad port ID
NodeManager::_AcquireNodeReference: node 1 not found
NodeManager::GetCloneForID: couldn't increment ref count, node 1 team 213
NodeManager::_AcquireNodeReference: node 1 not found
diegoroux commented 1 month ago

Thread 329 seems to crash as a result of thread 314's segment violation.

diegoroux commented 1 month ago

Screenshot 2024-07-16 at 23 53 31

diegoroux commented 1 month ago

BMediaEventLooper::ControlLoop calls BTimeSource::RealTimeFor here:

/* virtual */ void
BMediaEventLooper::ControlLoop()
{
        // [...]
        hasRealtime = fRealTimeQueue.HasEvents();
        hasEvent = fEventQueue.HasEvents();

        if (hasEvent) {
            waitUntil = TimeSource()->RealTimeFor(
                fEventQueue.FirstEventTime(),
                fEventLatency + fSchedulingLatency);
        } 
        // [...]
}

A BTimeSource instance is obtained with TimeSource:

BTimeSource*
BMediaNode::TimeSource() const
{
    // [...]
    BMediaNode* self = const_cast<BMediaNode*>(this);
    self->fTimeSource = MediaRosterEx(
        BMediaRoster::Roster())->MakeTimeSourceObject(fTimeSourceID);

    ASSERT(fTimeSource == self->fTimeSource);

    if (fTimeSource == 0) {
        ERROR("BMediaNode::TimeSource: MakeTimeSourceFor failed\n");
    } else {
        ASSERT(fTimeSourceID == fTimeSource->ID());
        fTimeSource->AddMe(self);
    }

    TRACE("BMediaNode::TimeSource node %" B_PRId32 " leave\n", ID());

    return fTimeSource;
}

TimeSource() allows for a null pointer to be returned, allowing RealTimeFor method to possibly be called with this == NULL.

diegoroux commented 1 month ago

BMediaRosterEx::MakeTimeSourceObject: GetNodeFor failed: Operation timed out

diegoroux commented 1 month ago
309: DEBUGGER: BufferCache::GetBuffer: IDs mismatch
debug_server: Thread 309 entered the debugger: Debugger call: `BufferCache::GetBuffer: IDs mismatch'
stack trace, current PC 0x10078bd70a9  </boot/system/lib/libroot.so> _kern_debugger + 0x9:
  (0x7f47be767160)  0x17491d8b16f  </boot/system/lib/libmedia.so> _ZN15BBufferConsumer13HandleMessageEiPKvm + 0x37f
  (0x7f47be767360)  0x17491d97bc3  </boot/system/lib/libmedia.so> _ZN10BMediaNode14WaitForMessageEljPv + 0x163
  (0x7f47be76b3c0)  0x17491d93e25  </boot/system/lib/libmedia.so> _ZN17BMediaEventLooper11ControlLoopEv + 0x1c5
  (0x7f47be76b4a0)  0x17491d939ef  </boot/system/lib/libmedia.so> _ZN17BMediaEventLooper19_ControlThreadStartEPv + 0x1f
  (0x7f47be76b4c0)  0x10078bd5dd9  </boot/system/lib/libroot.so> thread_entry + 0x19
diegoroux commented 1 month ago

Weeks went into fixing this, but after quite some testing, it seems like I've managed to fix this.

Matched with unreliable appearance of this page fault, it seems to be related to the recorded real_time and speed of the buffer_exchange returns.

korli commented 1 month ago

I'd recommend against using threads, it's not needed. It can be directly implemented directly in buffer_exchange(), something like. Having the next timestamp in each stream would make it easier to wait on the next one. The usedLength might be needed for the record stream.

    void* cookie;
    uint32 usedLength;
    do {
        snooze_until(stream->real_time + 1000000L / stream->rate * FRAMES_PER_BUFFER, CLOCK_REALTIME);
        acquire_sem(info->txSem);
        if (info->virtio->queue_dequeue(info->txQueue, &cookie, &usedLength)) {
            stream->buffer_cycle = (stream->buffer_cycle + 1) % BUFFERS;
            stream->real_time = system_time();
            stream->frames_count += FRAMES_PER_BUFFER;
        }

        status = send_playback_buffer(info, stream);
    } while (status == B_BUSY);
    if (status != B_OK) {
        ERROR("playback failed (%s)\n", strerror(status));
        return status;
    }

    buf_info.playback_buffer_cycle = stream->buffer_cycle;
    buf_info.played_real_time = stream->real_time;
    buf_info.played_frames_count = stream->frames_count;
diegoroux commented 1 month ago

I was trying to maintain a thread-free approach in the buffer-exchange, #3 proved to be quite the tricky bug to catch, I had to restart qemu 5000 times, to only have the bug appear 100 times. A nightmare.

If I did solve this issue, it seemed related two to things. One being I shouldn't have gone playback-first then recording, one of the issues stemmed from returning:

buf_info.recorded_real_time = 0; // or whatever garbage value we had there.
buf_info.recorded_frames_count = 0; // same goes here.

even though we were signalling: buf_info.flags = B_MULTI_BUFFER_PLAYBACK; exclusively.

The second part, seemed to be related to buffer_exchange being a bit slow at return times.

We differed, compared to other audio drivers, in not using interrupts nor threads to update:

stream->buffer_cycle = (stream->buffer_cycle + 1) % BUFFERS;
stream->real_time = system_time();
stream->frames_count += FRAMES_PER_BUFFER;

at 3AM trying this approach did not sound farfetched, and showed effects issue #3 seemed to be fixed and could finally get back into the desktop.

I decided to push this commit, with two goals in mind: to see if issue #3 wasn't just solved in my machine, and to show progress in fixing this issue (it did slow me down, quite a lot).

But I do agree with you, I'd rather have this in buffer_exchange(), which is what I'll be working up to as of now.

While maybe I shouldn't have pushed a half-fix, it worked but its probably not the best approach, I appreciate the feedback.

korli commented 1 month ago

Well, if you wait in buffer_exchange() then update buf_info only when queue_dequeue succeeds with a status OK, everything should line up.

diegoroux commented 1 month ago

Thanks for all your help.