Issue in pcm_dsnoop.c in alsa-lib

TE-N-ShengjiuWang commented 2 years ago

Hi Takashi Iwai, Jaroslav Kysela

We encountered an issue in the pcm_dsnoop use case, could you please help to have a look?

Issue description: With two instances for dsnoop type device running in parallel, after suspend/resume, one of the instances will be hung in memcpy because the very large copy size is obtained.

#3 0x0000ffffa78d5098 in snd_pcm_dsnoop_sync_ptr (pcm=0xaaab06563da0)
at pcm_dsnoop.c:158
dsnoop = 0xaaab06563c20
slave_hw_ptr = 64
old_slave_hw_ptr = 533120
avail = 187651522444320

Reason analysis: The root cause that I analysis is that after suspend/resume, one instance will get the SND_PCM_STATE_SUSPENDED state from slave pcm device, then it will do snd_pcm_prepare() and snd_pcm_start(), which will reset the dsnoop->slave_hw_ptr and the hw_ptr of slave pcm device, then the state of this instance is correct. But another instance may not get the SND_PCM_STATE_SUSPENDED state from slave pcm device because slave device may have been recovered by first instance, so the dsnoop->slave_hw_ptr is not reset. but because hw_ptr of slave pcm device has been reset, so there will be a very large "avail" size.

Solution: I didn't come up with a fix for this issue, seems there is no easy way to let another instance know this case and reset the dsnoop->slave_hw_ptr, could you please help?

perexg commented 2 years ago

https://lore.kernel.org/alsa-devel/s5hfsnzt4fd.wl-tiwai@suse.de/T/#t

TE-N-ShengjiuWang commented 2 years ago

Thanks for the fix. But it doesn't solve the issue totally,
I do two modification base on your fix, after test, the issue can't be reproduced. could you please review?

diff --git a/src/pcm/pcm_direct.c b/src/pcm/pcm_direct.c
index 78716f229a14..5bff4e0836e0 100644
--- a/src/pcm/pcm_direct.c
+++ b/src/pcm/pcm_direct.c
@@ -659,6 +659,14 @@ int snd_pcm_direct_slave_recover(snd_pcm_direct_t *direct)
  */
 int snd_pcm_direct_client_chk_xrun(snd_pcm_direct_t *direct, snd_pcm_t *pcm)
 {
+
+       /*
+        * Add semaphore protection for getting the 'shmptr->s.recoveries'.
+        * As another instance maybe in snd_pcm_direct_slave_recover() but
+        * haven't update the 'shmptr->s.recoveries', then this instance
+        * can't return correct state.
+        */
+       snd_pcm_direct_semaphore_down(direct, DIRECT_IPC_SEM_CLIENT);
        if (direct->shmptr->s.recoveries != direct->recoveries) {
                /* no matter how many xruns we missed -
                 * so don't increment but just update to actual counter
@@ -676,6 +684,7 @@ int snd_pcm_direct_client_chk_xrun(snd_pcm_direct_t *direct, snd_pcm_t *pcm)
                else
                        direct->state = SND_PCM_STATE_XRUN;
        }
+       snd_pcm_direct_semaphore_up(direct, DIRECT_IPC_SEM_CLIENT);
        if (direct->state == SND_PCM_STATE_XRUN)
                return -EPIPE;
        else if (direct->state == SND_PCM_STATE_SUSPENDED)
diff --git a/src/pcm/pcm_dsnoop.c b/src/pcm/pcm_dsnoop.c
index 1653f6fba86c..ee7506bff442 100644
--- a/src/pcm/pcm_dsnoop.c
+++ b/src/pcm/pcm_dsnoop.c
@@ -146,14 +146,22 @@ static int snd_pcm_dsnoop_sync_ptr(snd_pcm_t *pcm)
        default:
                break;
        }
-       err = snd_pcm_direct_client_chk_xrun(dsnoop, pcm);
-       if (err < 0)
-               return err;
        if (dsnoop->slowptr)
                snd_pcm_hwsync(dsnoop->spcm);
        old_slave_hw_ptr = dsnoop->slave_hw_ptr;
        snoop_timestamp(pcm);
        slave_hw_ptr = dsnoop->slave_hw_ptr;
+
+       /*
+        * FIXME: Move snd_pcm_direct_client_chk_xrun after getting the
+        * dsnoop->spcm->hw.ptr. If the snd_pcm_direct_slave_recover()
+        * of another instance happening before dsnoop->spcm->hw.ptr
+        * is got, then a wrong spcm->hw.ptr is got which cause a wrong
+        * 'diff' data later.
+        */
+       err = snd_pcm_direct_client_chk_xrun(dsnoop, pcm);
+       if (err < 0)
+               return err;
        diff = pcm_frame_diff(slave_hw_ptr, old_slave_hw_ptr, dsnoop->slave_boundary);
        if (diff == 0)          /* fast path */
                return 0;

By the way, the RECOVERIES_FLAG_SUSPENDED is in BIT(31), do we need to consider the overflow of recovery counter? if it overflow to BT(31), then the FLAG_SUSPEND should be wrongly set.

Best regards Wang shengjiu

tiwai commented 2 years ago

On Fri, 04 Mar 2022 09:29:54 +0100, Shengjiu Wang wrote:

Thanks for the fix. But it doesn't solve the issue totally, I do two modification base on your fix, after test, the issue can't be reproduced. could you please review?

Could you rather submit a proper patch?

I hesitated to apply the semaphore around snd_pcm_direct_client_chk_xrun() as it's a hot path. But it seems that we have to do it in anyway, and if any, this could be another patch (the function is already present and used for xrun checks). And, we can implement a fast-path there as well; if direct->state is XRUN or SUSPENDED at the beginning of the function, returns immediately.

Or, we may move the update of recoveries count in snd_pcm_direct_slave_recover() earlier, almost at the beginning of the procedure. Then other in parallel can catch the change, practically works even without semaphore.

By the way, the RECOVERIES_FLAG_SUSPENDED is in BIT(31), do we need to consider the overflow of recovery counter? if it overflow to BT(31), then the FLAG_SUSPEND should be wrongly set.

It's masked with RECOVERIES_MASK.

Takashi

TE-N-ShengjiuWang commented 2 years ago

Thanks

On Fri, 04 Mar 2022 09:29:54 +0100, Shengjiu Wang wrote: Thanks for the fix. But it doesn't solve the issue totally, I do two modification base on your fix, after test, the issue can't be reproduced. could you please review? Could you rather submit a proper patch? I hesitated to apply the semaphore around snd_pcm_direct_client_chk_xrun() as it's a hot path. But it seems that we have to do it in anyway, and if any, this could be another patch (the function is already present and used for xrun checks). And, we can implement a fast-path there as well; if direct->state is XRUN or SUSPENDED at the beginning of the function, returns immediately. Or, we may move the update of recoveries count in snd_pcm_direct_slave_recover() earlier, almost at the beginning of the procedure. Then other in parallel can catch the change, practically works even without semaphore.

I saw you have update the origin/topic/pcm-direct-resume branch, I test your latest change, it is more stable than before, but still meet once of the issue after overnight test, it it very very low possibility.

So I suggest if we need to do below change, shall we?

diff --git a/src/pcm/pcm_dsnoop.c b/src/pcm/pcm_dsnoop.c
index 729ff447b41f..cc333b3f4384 100644
--- a/src/pcm/pcm_dsnoop.c
+++ b/src/pcm/pcm_dsnoop.c
@@ -134,14 +134,21 @@ static int snd_pcm_dsnoop_sync_ptr(snd_pcm_t *pcm)
        snd_pcm_sframes_t diff;
        int err;

-       err = snd_pcm_direct_check_xrun(dsnoop, pcm);
-       if (err < 0)
-               return err;
        if (dsnoop->slowptr)
                snd_pcm_hwsync(dsnoop->spcm);
        old_slave_hw_ptr = dsnoop->slave_hw_ptr;
        snoop_timestamp(pcm);
        slave_hw_ptr = dsnoop->slave_hw_ptr;
+       /*
+        * FIXME: Move snd_pcm_direct_client_chk_xrun after getting the
+        * dsnoop->spcm->hw.ptr. If the snd_pcm_direct_slave_recover()
+        * of another instance happening before dsnoop->spcm->hw.ptr
+        * is got, then a wrong spcm->hw.ptr is got which cause a wrong
+        * 'diff' data later.
+        */
+       err = snd_pcm_direct_check_xrun(dsnoop, pcm);
+       if (err < 0)
+               return err;
        diff = pcm_frame_diff(slave_hw_ptr, old_slave_hw_ptr, dsnoop->slave_boundary);

By the way, the RECOVERIES_FLAG_SUSPENDED is in BIT(31), do we need to consider the overflow of recovery counter? if it overflow to BT(31), then the FLAG_SUSPEND should be wrongly set. It's masked with RECOVERIES_MASK. Takashi

perexg commented 2 years ago

For reference, this issue was fixed in bb9f258e192980bb6cbef74ac57e348553805a0c .

alsa-project / alsa-lib

Issue in pcm_dsnoop.c in alsa-lib #213