genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.07k stars 253 forks source link

[mixer/audio_drv] can get stuck when fed by OSS #4727

Closed ttcoder closed 6 months ago

ttcoder commented 1 year ago

Started using OSS (vfs_oss.lib.so) in my radio-station apps, but sometimes (every couple hours or so) audio stops. After various attempts at understanding the problem, I noticed something important : if I open an mp3 in another (different) application, that new audio playback "unstucks" the first application's playback. That would hint at a problem server-side (in the mixer), rather than client-side. So I went to instrument mixer.

If I add tracing in mixer, in e.g. the _handle() hook, the bug occurs after a few seconds (instead of several hours !), and shows that the _handle() hook stops being called. If I re-attempt to play the mp3 file from the beginning, playback restarts (the _handle hook resumes being called as it should), and again, stalls after a few seconds. Easily reproducible here on my R61 thinkpad.

So I guess that would point the finger yet more upstream, at pci_audio_drv itself, rather than the mixer ? Now I'm going to add tracing to pci_audio_drv, will find out which part of it sends the signal, and add a Genode::log() call there.

ttcoder commented 1 year ago

Turns out that pci_audio_drv does send the signal, but said signal is not received and/or processed by mixer.

My dde_bsd instrumenting:

extern "C" void notify_play()
{
    if (_play_sigh.valid())
    {
        GENODE_LOG_TSC( 100 );
        //Genode::log( "    notify_play()" );
        Genode::Signal_transmitter(_play_sigh).submit();
    }
    else
    {
        GENODE_LOG_TSC_NAMED( 1, "notify_play: INVALID sigh......." );
        //Genode::log( "    notify_play() : INVALID sigh" );
    }
}

I do get an uninterrupted flow of the first GENODE_LOG_TSC logging, even after mixer stops logging.

How is it possible for the signal to be sent, but not received? Presumably we should be confident it is sent and sent to the right "target", since the code goes past the "_play_sigh.valid()" if() test.

(am I looking at the right sigh by the way? that's the one set in handle_announce_session())

Anyway that would mean the audio driver is doing its job, and it is in fact the mixer which is unresponsive to signals. Could be it's stuck in an infinite loop ? Well, not "infinite", since mixer can be "unstuck" by starting a new session. Maybe that's where I should dig next : what happens when a new mixer session is created, that can "unstuck" it and make it receive signals again...

EDIT:

Okay, with logging cleaned-up, I now see a big hint : it seems at the very moment when the mixer stops processing signals, there is a client-side warning (from the application connected to the mixer) like this:

warning: vfs_oss: underrun(2)

So I guess the enigma now looks like this: what is it that happens to the mixer that makes it...

cnuke commented 1 year ago

Thanks for investigating the problem.

Without having the overall picture in mind I would suggest to instrument the mixer with regard to how it advances each session. It should only get "stuck" when all sessions are stopped because I would expect the mixer to send a progress signal otherwise. Maybe the underrun condition somehow leads to that situation that from the mixer's perspective no session is active any more.

The audio driver on the other hand should only send a progress signal when it played an audio packet (at least that is the behaviour on master where the driver does not progress if the current packet is infact invalid). Since you see that this indeed appears to be the case the mixer still submits packets to the driver, I am wondering if only the clients of the mixer are "stuck".

I encountered a similar problem when using a client that uses the audio session in “streaming mode”, i.e, it calls alloc_packet, submits some packets and relies on the progress signal to submit the next packets in a continous fashion. It does not manages the audio packets, like for example the mixer does, directly(*). This component also appears to be stuck and only if a client using the vfs_oss plugin, like cmus, is started the first client starts to play also.

(*) I fear that one of the puzzle pieces is how the mixer manages the stream directly, so that would be one point of interest to check out.

ttcoder commented 1 year ago

It's re-assuring that someone else is seeing the bug like I do :-)

My latest findings: I'm seeing 2 different bugs (or at least 2 symptoms for one bug):

1) sometimes there's a "soft" blocking of the client, i.e. it's possible to "nudge" mixer by creating a new audio session, and the first audio client gets unblocked thanks to the second one.

2) sometimes there's a "hard" blocking, meaning that the first client remains stuck, no matter how much I "nudge" the mixer with new client sessions.

During Thursday and Friday I managed to reproduce the bug 3 times (I'm back to "multi hours sessions required" unfortunately), twice type 1) and once type 2). The on-screen logging was not super useful, as it was TSC style logs with high divisors, so not "fresh/latest" event logging.

On the mixer side, I guess mixer's _handle() seems to be only client driven, not pci_audio_drv driven, so I should not focus on _handle not being called by the audio driver. It's a two-liner that just calls mix() and advance(), and adding some gentle tracing (GENODE_LOG_TSC(50) so as to have slow-moving log lines in a separate terminal) makes it quite obvious that when the bug occurs, _handle() is no longer executed indeed, the mixer is inert, resulting in silent audio and in blocked client. EDIT: maybe pci_audio_drv chooses to not send a signal to mixer when in play_silence() mode ; in a way, that would make sense, but "further study needed".

So the theory would be, the client fails to "wake up" (nudge) mixer sometimes, and mixer is not designed to wake itself up, it's not going to call its own _handle() without external intervention.

But to validate that I need to know which line is executed last, in the client. I'm tempted to use some sort of "ring buffer" logging mechanism, with a small buffer size, to capture the last second or so of activity and determine which part of vfs_oss is called last. Could it be e.g. submit(left/right_packet) which fails to return? etc. It seems Genode::trace() would fit the bill, based on my reading of releease notes and Genodians articles. But I can't get it to work. It asks for a "null" ROM so I create an empty file with that and ROM it, and that silences the error message, but then no virtual subfolders are created for threads despite what the doc describes. If I can't get it to work I'll try to roll out my own quick'n'dirty logging mechanism and report back in this ticket.

Depending on how that turns out, maybe mixer will be to blame instead of the client, I'll know if I collect more hints. This whole thing's quite a struggle (I'm not a ninja any more these days, lol) but I'll hang in there and persevere.

cnuke commented 1 year ago

So the theory would be, the client fails to "wake up" (nudge) mixer sometimes, and mixer is not designed to wake itself up, it's not going to call its own _handle() without external intervention.

That is a most likely case as the signal-handler is used as progress_sigh by the Audio_out connection going to the driver. You could revert the commit from #4609 and that will execute the pc_audio_drv and mixer in tandem again.

But as the current driver behaviour is the intended one, the mixer should be made self-aware ☺.

ttcoder commented 1 year ago

Applied the "patches" and moving forward:

Yesterday the stoppage occured after 9 hours of playback (with both changes applied).

The day before it had taken only 2h30 (had only applied the first, i.e. commented out the return statement before play_silence()). With that baked in, the mixer keeps checking for data (even in the play_silence() case, as I can observe with its continuous logging, so the patch behaves as expected), but the client does not respond nor push anything -- so it's probably the culprit indeed.

I hope to find out more soon : I wrote a 100 line "ring logging" class I'm quite happy with. That should act as a black box of sort, for a post-mortem. In both cases cited above, the client seemed to be gone into an infinite loop (high CPU fan spin rate, hot CPU exhaust) so I added a lot of (ring) logging in vfs_oss.write(). If it's stuck on any given line I'll know it, and if it's looping endlessly that should be apparent as well.

EDIT:

just got it to occur now, after 5 hours ; the ring log mentions:

handle_out_progress handle_out_progress handle_out_progress handle_out_progress
handle_out_progress handle_out_progress handle_out_progress handle_out_progress

Looks like I should weed out the logging in that callback hook, as it "flushes" the more interesting contents from the write() method (and I should increase the 'ring' buffer size too maybe). I wonder how come the signal is processed, if the thread is already busy-looping... Or maybe signal handling is processed in a separate, dedicated thread, but then it should be blocked on a mutex or something... Anyway that's a question for another day.

EDIT 2:

There are no timestamps in the output, but I believe (from peculiarities in the layout my ringlogger's output etc) that handle_out_progress () is not being called continuously, but in fact the above is a snapshot of the last calls that were made, before it stopped being called.

So, theory : maybe it didn't get called when it should, i.e. vfs_oss is missing "one last call", the one extra call that would have unblocked it ? Will check the code, but I think handle_out_progress() resets the blocked variable to false. So there could be a race condition (the hook is called to reset the variable a millisecond before the variable is assigned true) or some such. Either that or there's an infinite loop elsewhere, logged in the (unfortunately) "flushed" part of the log.

ttcoder commented 1 year ago

A quick note about a discrepancy I found in the oss code (probably harmless) : https://github.com/genodelabs/genode/blob/25eac6b9e6fe3d0164e010cabc76a82fa7901601/repos/libports/src/lib/vfs/oss/vfs_oss.cc#L372

The comment says "This is usually the progress signal for the first packet after 'start()', which is invalid." but with my newfound logger, I can observe the opposite : this if() gets triggered regularly all along audio playback, from the beginning to the end of the mp3 file being played. It seems to always follow a completed write().

(and fifo_samples_new always has a value like ca. 115000 or 116000)

EDIT:

Looking at the code online, I see a few difference from my 22.10 Genode repo, e.g. usage of _vfs_user.wakeup_vfs_user() which I recall is a recent Genode API change. Probably irrelevant to this discussion anyway.

ttcoder commented 1 year ago

Ok now I'm confused. In the online (recent) rev of vfs_oss.cc, the blocked variable is only written to, but never read.

It's declared here: bool blocked = false;

And then it's assigned the value true but no one ever reads that value back (unless FireFox search is failing me).

Maybe there was some clean-up involved, and the variable was forgotten during the clean-up ?

cproc commented 1 year ago

The comment says "This is usually the progress signal for the first packet after 'start()', which is invalid." but with my newfound logger, I can observe the opposite : this if() gets triggered regularly all along audio playback, from the beginning to the end of the mp3 file being played. It seems to always follow a completed write().

A possible explanation could be that the stream gets stopped because of buffer underrun at

https://github.com/genodelabs/genode/blob/25eac6b9e6fe3d0164e010cabc76a82fa7901601/repos/libports/src/lib/vfs/oss/vfs_oss.cc#L387

and then restarted on the next write(). But it's just a guess.

ttcoder commented 1 year ago

Well, in normal usage (middle-of-audio-stream) the stats show that there's a norminal 90% of the audio buffer queue filled up, circa 2 or 3 seconds worth I think, and I never get "under-run" warnings. So it does seem that the comment is misleading. I also instrumented the value of fifo_samples_new and it's consistently 115000 or 116000. (contrary to when the bug occurs, see below).

**

Got a 'clean' bug reproduction and logging today !

The last logged events are about halt_output() (which we were discussing). The value of fifo_samples_new is 0 (empty). That's after starting a new track, which is where the bug seems to always happen. At that point the program is dead. Nothing happens past that. So here's what is happening, it seems:

There might be something else at play though, explaining the high CPU fan spin rate, maybe the VFS layer (sitting between the application's ::write() and the vfs plugin's vfs_oss.write()) is gone into an infinite loop? Seems unlikely though. Probably not important.

Anyway the basic problem seems to be that the client lives and dies depending on server signals and client writes, but sometimes it ends up in a state where neither of those occur, typically after starting a new track.

ttcoder commented 1 year ago

Here's the log I captured, edited for clarity. If the bug occured during "overlap" transition time as I suspect, it means there were two tracks playing (each with their own vfs_oss file descriptor and thread calling write()), not sure.

Anyway it shows the plugin's write() being called and completing (presumably, returning control to application's side, since it's non blocking ?), but then the callback is called and detects an under-run and halts output.

I'm going to review the code to understand how we go from fsn==116k to fsn==0 all of a sudden without exhausting 3 seconds of audio first. Or maybe there are 2 tracks playing and one track is confusing the other, but I don't see "static" variables that would be shared between 2 instances of the plug-in, whereby one plug'ins code execution would corrupt the state of the other instance of the plug-in, so that's probably not the problem.

Haop (handle audio output):
  special-case for First Packet (with fsn=115200)

[oss write] Write: non-blocking case
  alloc.. pos 196 submit..
  alloc.. pos 197 submit..
  alloc.. pos 198
  update_ofrag..
  Return (exit from write())
Haop (handle audio output):
  special-case for First Packet (with fsn=116224)
(..)
Haop (handle audio output):
  !Under-Run! with fsn=0  Audio.stopping+updatefrag.. Stop done!

EDIT:

Today's test : on a whim, I commented out the code that handles under-runs, specifically these 2 lines : halt_output(); _write_sample_offset = 0; After a couple hours, the bug occured and now the logging reports these, in a cycle:

Haop: under-run! with fsn= 130048
Haop: under-run! with fsn= 0   Special-case for first packet fsn=0
Haop: under-run! with fsn= 130048
Haop: under-run! with fsn= 0   Special-case for first packet fsn=0
Haop: under-run! with fsn= 130048
Haop: under-run! with fsn= 0   Special-case for first packet fsn=0

There are also a bunch of other cyclic messages : every 3 seconds I get the "warning: underrun" and an error 22 on DSP_RESET.

The above looks like two tracks with two distinct states, alternating lines... Except that there should be an instance of my logger object in each of the two vfs_oss plugin instances... But the above is from one logger instance. So it seems one vfs_oss can get "alternating" fsn values.

cnuke commented 1 year ago

Let me briefly outline the flow of the Audio_out session without the return statement in play_silence() (mind you, by now your remembrance of the inner workings of the vfs_oss plugin are fresher than mine so I am probably not of much help there):

The driver will loop through its out stream and send a progress signal for ever packet - even if this packet is invalid. The mixer in return will respond to that signal and will advance its clients and will try to mix the packet at this position in the client's stream. This behaviour is the reason for some of the complexity in the vfs_oss plugin (calling reset() etc). The cyclic message could correlated with the Audio_out::QUEUE_SIZE that covers around ~2.7s, so the mixer loop through the whole stream and that gets detected by the plugin. But all in all the processing should never stop (the client might have to reset its stream and drop packets though).

On the other hand, with the return statement the driver and in return the mixer will only make progress when on of the clients has nudged it (look for start()/stop() in the mixer and _mix() that is also triggered by config updates and new sessions). So it is likely your observered (corner-)case where the plugin has been stopped and waits for the mixer, the mixer also wait for some client to make progress.

ttcoder commented 1 year ago

I finally replicated the (?) bug in Qemu, where it's much easier to read through the LOG history. That allowed me to spot a "unresolvable exception" message, whereby a critical thread is dead. The problem seems unrelated to OSS or media related code though (which would make this ticket invalid/user-error, not for the first time in my case!):

[init -> AutoCast]   BTrack.CurrentTime...
[init -> AutoCast]   bmediafile.main.progress
(..snip..)
[init -> AutoCast] activeevents.Tick()
Warning: unresolvable exception 13, pd 'init -> AutoCast', thread 'pthread.20', cpu 0, ip=0x1779c84 sp=0x41efd540 bp=0x41efd570 no signal handler

The instruction pointer resolves to log():

> pc
0x1779c84 - 0x1319000
             4590724  0x00460c84         char: .....F..
> /usr/local/genode/tool/21.05/bin/genode-x86-addr2line -e /GenCharlie/_nimble_Master/dev_
chiselapp.org/hog/genode-haiku/haiku-on-genode/system/kinda-kernel/haiku.lib.so
0x00460c84
/GenCharlie/_nimble_Master/dev_github.com/genode/repos/base/include/base/log.h:71
# that's a call to _release() inside log()

I no longer use signal handlers connected to the mixer in this week's code trial (my new code is entirely push-driven). Gotta sleep on it and think that over.

EDIT:

11feb23: in today's test so far, 120+ songs have been played without a hitch. Didn't change much, just adjusted the polling frequency of my "active media" thread.

EDIT 2 : Looks like I spoke too soon, it just takes longer to lock up, hmmmm

ttcoder commented 1 year ago

So I'm back in business.

I've been working on two different paths for audio, especially the latter:

With all I've learned about audio-session, I re-wrote my code to be very sweet and quite elegant (I don't use signals at all, I just check the value of queued() to decide whether to sleep or queue up more packets), but there remained an issue with occasional lock-ups, every few hours. Similar story to the one above. Long story short, turns out my playlist parser code accesses symbolic links, and sometimes that triggers a packet operation 0 failed error, and then it locks up most everything, including audio. Didn't get to the bottom of it yet (might be due to my ntfs driver variant ?) but at least I can now comment-out that part of the code, and now I'm rock solid again. Between yesterday and today I've played 50+180 songs and accumulated almost 20 hours of playback, without a hitch. So sure looks like I'm good. Later on I'll re-enable the faulty code, get to the bottom of that symlink stuff and fix it for good.

I don't know if my vfs_oss troubles (in this ticket) were caused by the same root problem, I guess it's more likely than not.

Leaving this ticket open as I hope to return to testing vfs_oss in a few weeks (in fact my app should probably provide a GUI allowing the user to select the output mode that works best for them : either 'OSS' or 'native') and confirm that this ticket can be closed as "user error" or such.

cnuke commented 1 year ago

I am glad you finally found a workaround to the problem.

Since it seems you did spend a substantial amount of time looking into the various components (driver, mixer and OSS VFS plugin) are your notes (if you have any that is) by chance available somewhere? As you may have stumpled upon one or the other “subtelty” of the current Audio_out session's implementation these could help smoothing things out.

ttcoder commented 1 year ago

The one "biggie" I should share, is about the queued() accessor : sometimes (during under-runs ?) it goes wildly off, it returns "random" values, up to 2500 (i.e. ten times more than the theoritical ceiling of 255).

So I've changed my original code (pseudo-code ahead, I'm not on my dev laptop)...

const int threshold = 32;  // low latency -- will use 64 or even 128 for high-robustness/no-underrun-silence
if( stream->queued() < threshold )
{
    decode_and_queue_up();
}

... to add this warning:

...
if( stream->queued() < threshold )
{
    decode_and_queue_up();
}
else if( stream->queued() == threshold )
{
    // do nothing -- stop adding to the pipeline, we have enough
}
else
{
    Genode::warning( "mixer queue is above threshold, which is theoritically impossible, we're probably in under-run and playing silence for the next second or so : ", stream->queued(), " please consider increasing latency/threshold !" );  // the value can be anything between 33 and 2500...
}

With all I've learned and all the infrastructure in place, I'm going to fine-tune this to be some sweet, low-latency, high reliability, audio module :-)

I most definitely plan to commit my audio code to my Fossil repository, probably soon after you guys release 23.02 ; but even so, I'll probably revisit my notes and try to distillate any remaining nugget of info of interest to you, though the above nugget seems to be more important than any other.

PS - in the other ticket I'll also mention that I definitely have to rem out the "return" in pci_audio_drv, otherwise the audio driver does not work on my thinkpads here. It's trivial enough that I can maintain that one-line "patch" on my side, no need to make any change or reversal of direction in upstream Genode I think.

Edit (5mar23)

I'm stuck debugging a vfs plug-in memory leak ; hopefully it will be fixed in a few days, then I can resume my commits of media code as promised.

Edit (12mar23)

Looks like I'm good to go again. Resuming commits. They will allow to replicate the queued() bug on 22.10, and then I'll make commits that bring my code up to 23.02

ttcoder commented 1 year ago

I've pushed my audio-out code at last ! Here's the commit :

http://chiselapp.com/user/ttcoder/repository/genode-haiku/info/2dd1f3902da2ada7

Here's one of the lines that refers to queue() problems:

https://chiselapp.com/user/ttcoder/repository/genode-haiku/file?ci=2dd1f3902da2ada7fc52a49a3d71bdd81f357d3b&name=genode-haiku/haiku-on-genode/kits/media/_ge_AudioOut_native.cpp&ln=173-174

Here's a (trimmed down) "grep FIXME":

_bmediafile_Temp.cpp:210:       ///FIXME-2: nope, makes no difference, still getting "impossible queue count"... seems harmless though ?
_ge_AudioOut_native.cpp:135:///FIXME-2: strange, I can't get this message to disappear, even if I enlarge buffer sizes and the queue threshold and decrease the snooze time... I get this situation at the beginning of each mp3, for a second or two, even after configuring to queue up 64 packets... File a ticket ? Or am I using the Audio_out session wrong ?
_ge_AudioOut_native.cpp:136:///FIXME-2: also, why can't I play two mp3s simultaneously? That's the very thing that the mixer ought to support :-P  Yet starting a second mp3 in eg ArmyKnife yields choppy sound or no sound, even as the first mp3 in AC continues to play... What am I doing wrong ?
_ge_AudioOut_native.cpp:169:///FIXME-2: how come queued() returns 1 (instead of 0) the very first time we're called, before we even start to queue buffers ?

Plus something not mentionned in the source, only in my notes:

When a client disconnects, mixer stops playback immediately (so long as that connection was the only/last one)... but then later resumes it, i.e. it "flushes" the remainder of the old client's packets, next time I create a new connection to play another mp3 file. I suppose the fact that mixer has "left-over" data is unavoidable, I cannot influence it from my end, those packets are out of reach of the client, once submitted ; but it's not my intention that mixer should resume playback of them later. It might make sense to discard any left-over it has from a client, once that client disconnects.

All of that is a bit off-topic for this ticket, but maybe best to use this ticket as a "work area" for problems related to Audio-out..!

Regarding specifically the most important bug I experience (lock-ups), it seems they are related to vfs rather than Audio-out. I wonder if they could be due to a missing "read_ready()" call somewhere in the vfs stack, that only impacts my use-case. (thought inspired by watching a ticket exchange between nfeske and trimpin in #4785 today).

nfeske commented 1 year ago

Thanks @ttcoder for sharing your findings and thoughts. I'm unfortunately not well versed with the audio path (mixer, audio sessions, oss plugin) - so I cannot of much help for diagnosis of the driver/mixer/session-related parts.

I wonder if they could be due to a missing "read_ready()" call somewhere in the vfs stack

The said read-ready mechanism is needed in situations where input happens sporadically. It was added for accommodating, e.g., a shell waiting for user input from a terminal. In essence, it allows a vfs plugin to influence the readability state of a file descriptor in a application's select fd set. In other words, when the user presses a key in the terminal, the select call of the shell returns and can process the key.

I don't know if the fds of the oss device are typically managed via select. Is your program using select? What I see is that the vfs_oss code implements read_ready by evaluating a condition of some sort. That means, under some circumstances it may indicate to select that blocking is in order. But I see no call of read_ready_response anywhere nearby. So the unblocking - once the condition becomes satisfied - may never happen.

To test your hypothesis, you could try changing the read_ready implementation(s) of the VFS plugin to always return true. This way, such an fd always fires when being part of a select fd set. So no blocking happens. Naturally, this may cause a busy select loop, which is probably not what you want in the end. But for testing, it is worth trying.

nfeske commented 1 year ago

@ttcoder In the scenario where you observe the strange blocking, do you have the vfs_oss plugin mounted in your application's local VFS or in a separate VFS server?

cproc commented 1 year ago

What I see is that the vfs_oss code implements read_ready by evaluating a condition of some sort. That means, under some circumstances it may indicate to select that blocking is in order. But I see no call of read_ready_response anywhere nearby. So the unblocking - once the condition becomes satisfied - may never happen.

I changed the ready_ready() implementation from always returning true to the condition evaluation in commit 33402e4. If this change required the addition of a call of read_ready_response(), that is indeed missing. I think the location to add it would be in Vfs::Oss_file_system::Data_file_system::_handle_audio_in_progress() and based on the implementation in terminal_file_system.h should probably look like this?

_handle_registry.for_each([] (Registered_handle &handle) {                                                                           
    if (handle.notifying) {                                                                                                      
        handle.notifying = false;                                                                                            
        handle.read_ready_response();                                                                                        
    }                                                                                                                            
});     
ttcoder commented 1 year ago

@ttcoder In the scenario where you observe the strange blocking, do you have the vfs_oss plugin mounted in your application's local VFS or in a separate VFS server?

Currently each application (AutoCast, ArmyKnife etc) has its own vfs_oss mounted at its /dev/dsp... Which incidentally, means the first (e.g. AK) app gets "Audio-in" for itself, and others (e.g. AC) and the next in line gets a message like "Audio input busy", unless audio-input is redirected to black hole. No such problem for Audio-out, mixer of course does not have such a limitation. It had never occured to me to have a unique vfs_oss in a vfs server, though it's an intruiguing prospect :-) That might help with the audio-in part (a feature which I might need next year), though I'm worried it would lead to increased latency, it always feels nicer to the user when clicking a 'start' or 'stop' button reacts instantly with very low latency.

I'm going to file a ticket specific to my "main problematic bug" soon, now that I've gathered data. It will be titled something like "packet operation=0 failed", since that happens to be the error() I see a lot when my apps run, and I suspect it's related to the "lock up" that I see every few hours, since it seems to follow on the heels of such an error when it does occur. Most of the time the error seems harmless, so I had not been digging very deep on it yet, I have looked just enough to understand that the error message is printed whenever my vfs_ntfs code fails to find an xattr, and returns READ_ERR_INVALID in this file to signify that the xattr does not exist... So I assume sometimes that xattr-not-found situation occurs at the same time as "something else" and triggers a corner case. More soon.

To get back on topic : I can't promise to experiment on vfs_oss soon as I'm going full steam with the alternate (audio-out_Native) code right now, since it works 'good enough' for sending an alpha build to the rest of my crew soon. When that's done I'll get a breather :-)

EDIT: re. the previous post : I don't use select() in my audio-out_Oss code. But from what I remember when using that alternate path, it had its own blockage (different from the one I experience now ?) when calling write() to write the audio samples to the OSS plug-in, and I think this might be what Christian is referring to : the oss plug-in sometimes cannot handle a write() in one go, it has to wait for the consumer to consume buffers and "make room" for more data, before actually writing the data and unblocking the client. (or something like that)

EDIT (20mar23):

Got to amend what I said previously about the need to patch the driver to get audio output : today I upgraded to Genode 22.11, did a clean build of everything (patching only for xattr as usual), and the unmodified pci_audio_drv and unmodified mixer do produce audio output on my T410, so my problem was probably just a fluke, that's one less item on the list.

Also it's starting to look like a few of the FIXME's mentionned above are "resolved" by doing that clean build, I'm not seeing many "impossible queue count" warnings any more now.

EDIT (21mar)

THe "impossible queue counts" are less frequent but I'm still seeing them. However the most interesting thing is, I'm now seeing "stalls" (soft stops), sometimes, when starting a new mp3. Those are not lock-ups : if I click the Start button of AutoCast, it can start a new mp3. This should be easier to debug than the previous bug, by virtue of being non blocking, and they seem to occur after just a dozen songs.

nfeske commented 1 year ago

It had never occured to me to have a unique vfs_oss in a vfs server, though it's an intruiguing prospect

I did not mean to suggest that, just wanted to get a clearer picture. ;-)

ttcoder commented 1 year ago

Ok I've found an even easier way to trigger a bug, than the one described in 'EDIT's' above : I click the Start button a few times in AutoCast, and then I end up in this state:

This is all with a fresh build of Genode 22.11, unpatched media code, so should be my most relevant observation yet. (Edit, using my "native" back-end rather than vfs_oss based back-end, BTW)

Maybe I should in fact patch pci_audio_drv, remove the return statement before play_silence() ? Or maybe this time the problem is in mixer rather than in the driver ? I'll dig and report back here.

EDIT: that was it indeed

All the above symptoms are negated by removing the 'return' statement that prevents advancement in pci_audio_drv. I've witnessed 4 hours of playback straight now. Gotta mention that in the wiki of my repository, "please apply such and such patch to Genode before building HoG for maximum reliability" ;) .

Now with some luck, the old bug ("hard lock-up/stuck application") will be gone too ? Let's test for a few days

EDIT (27mar23)

Yup, the Genode upgrade combined with the return tweak, was the trick indeed, cruisin' along and preparing to do an actual release (!) soon-ish.

ttcoder commented 1 year ago

Upgraded to 23.02, same thing : audio stops regularly, until I add the return statement, then it's flawless.

My plan is to upgrade to 23.05 next, and close this ticket (either as "fixed" if by chance the return patch is no longer needed, or as an "archived" ticket for a problem that just affects me but which should be kept in mind if others are affected in the future).

nfeske commented 1 year ago

It looks like your patch change should best be applied to upstream, but I'm not really in the position to make this assessment. @cnuke would you consider preparing a patch, or give advice how to proceed?

cnuke commented 1 year ago

The return statement was added back in intentionally by 003247d (I touched on the implication in https://github.com/genodelabs/genode/issues/4727#issuecomment-1416133073).

The way forward would be to adapt the mixer to the changed behavior - I'll create an issue documenting the current situation.

nfeske commented 6 months ago

As the mixer has been superseded by the record_play_mixer, this issue has become stale.

@ttcoder would you agree to close this one and discuss further audio troubles in new issues?

ttcoder commented 6 months ago

Indeed, closing !

P.S.: I still could be on track to nicely migrate to the new mixer via OSS : it turns out the "LD" linker problem I thought I had, was in fact yet another mistake I made and mis-routing ROMs, meaning the binary ROM of vfs_oss.lib.so was not read correctly. With that in mind, I'll soon return to the new mixer once I sort out my driver_manager work.

nfeske commented 6 months ago

@ttcoder thanks to responding so quickly! :)

ttcoder commented 6 months ago

Well I saw the open ticket count getting closer to 100 and thought I'd give it a nudge to help it get below the threshold 😁 (down to 98 as of this writing)