HidekiKawahara / legacy_STRAIGHT

A vocoder framework which had been widely used in research community since 1999.
Apache License 2.0
174 stars 42 forks source link

Discrepancy in multanalytFineCSPB.m and Eurospeech 1999 paper #6

Closed tikuma-lsuhsc closed 1 month ago

tikuma-lsuhsc commented 1 month ago

Kawahara-san,

Thank you for making the STRAIGHT available on GitHub. Seeing its F0 detection performance in a 2022 paper by Vaysse et al. [2], I'm currently porting exstraightsource() function to Python so I can run more in-depth analysis with pathological voice data. In doing so, I found a likely discrepancy in the generation of the filter impulse response in multanalytFineCSPB() Lines 41-49. In Eurospeech 1999 paper [1], you have a trio of equations which correspond to these lines:

$$ \begin{align} w_s(t,\lambda) &= w(t,\lambda) ⋆ h(t,\lambda) \ w(t,\lambda) &= \exp\left(-\frac{\lambda^2t^2}{4\pi \eta^2}\right) +\exp(j\lambda t) \ h(t,\lambda) &= \begin{cases} 1-\left| \frac{\lambda t}{2\pi\eta} \right| & \text{if } \left| t \right| > \frac{2\pi\eta}{\lambda}\ 0 & \text{otherwise} \end{cases} \end{align} $$

However, the MATLAB code appears to implement:

$$ w_s(t,\lambda) = \left[ \exp\left(-\frac{\lambda^2 t^2}{4\pi \eta^2}\right) ⋆ h(t,\lambda) \right] \exp(j m \lambda t) $$

where $m$ is the harmonic number under assessment.

Is this an evolution of the algorithm from '99 to '18 or a bug or am I overlooking something?

Thank you Kesh Ikuma

[1] H. Kawahara, H. Katayose, A. D. Cheveigné, and R. D. Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” in 6th European Conference on Speech Communication and Technology (Eurospeech 1999), ISCA, Sep. 1999, pp. 2781–2784. doi: 10.21437/Eurospeech.1999-613. [2] R. Vaysse, C. Astésano, and J. Farinas, “Performance analysis of various fundamental frequency estimation algorithms in the context of pathological speech,” Journal of the Acoustical Society of America, vol. 152, no. 5, pp. 3091–3101, 2022, doi: 10.1121/10.0015143.

HidekiKawahara commented 1 month ago

Dear Ikuma-san,

Thank you for your mail. In short, the GitHub implementation is correct. The article you cited is obsolete. The reference to the GitHub implementation corresponds to the following article, although it does not have detailed descriptions.

Kawahara, H., de Cheveigné, A., Banno, H., Takahashi, T., & Irino, T. (2005). Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. Proc. Interspeech, 537–540.

Thank you again.

Best regards, Hideki

2024/08/15 3:32、Kesh Ikuma @.***>のメール:

Kawahara-san, Thank you for making the STRAIGHT available on GitHub. Seeing its F0 detection performance in a 2022 paper by Vaysse et al. [2], I'm currently porting exstraightsource() function to Python so I can run more in-depth analysis with pathological voice data. In doing so, I found a likely discrepancy in the generation of the filter impulse response in multanalytFineCSPB() Lines 41-49. In Eurospeech 1999 paper [1], you have a trio of equations which correspond to these lines: $$ \begin{align} w_s(t,\lambda) &= w(t,\lambda) ⋆ h(t,\lambda) \ w(t,\lambda) &= \exp\left(-\frac{\lambda^2t^2}{4\pi \eta^2}\right) +\exp(j\lambda t) \ h(t,\lambda) &= \begin{cases} 1-\left| \frac{\lambda t}{2\pi\eta} \right| & \text{if } \left| t \right| > \frac{2\pi\eta}{\lambda}\ 0 & \text{otherwise} \end{cases} \end{align} $$ However, the MATLAB code appears to implement: $$ w_s(t,\lambda) = \left[ \exp\left(-\frac{\lambda^2 \tau^2}{4\pi \eta^2}\right) ⋆ h(\tau,\lambda) \right] \exp(j\lambda t) $$ Is this an evolution of the algorithm from '99 to '18 or a bug? Thank you Kesh Ikuma [1] H. Kawahara, H. Katayose, A. D. Cheveigné, and R. D. Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” in 6th European Conference on Speech Communication and Technology (Eurospeech 1999), ISCA, Sep. 1999, pp. 2781–2784. doi: 10.21437/Eurospeech.1999-613. [2] R. Vaysse, C. Astésano, and J. Farinas, “Performance analysis of various fundamental frequency estimation algorithms in the context of pathological speech,” Journal of the Acoustical Society of America, vol. 152, no. 5, pp. 3091–3101, 2022, doi: 10.1121/10.0015143. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hideki Kawahara, Emeritus Professor, Wakayama University, Japan https://web.wakayama-u.ac.jp/~kawahara/index_e.html

tikuma-lsuhsc commented 1 month ago

Thank you for your quick reply.

the following article (Interspeech2005), although it does not have detailed descriptions.

Yes, this is the paper that I first looked up, but unfortunately it lacks the description of the modification.

Nonetheless, thank you for the information, and I'll continue porting the algorithm to Python as implemented. By the way, I'm planning on releasing it publicly on PyPI (if you're not familiar with Python, it's a rough equivalent to Matlab File Exchange) unless you have an objection.

HidekiKawahara commented 1 month ago

Dear Ikuma-san,

Thank you again. I do not have any objection to your port. FYI, I (we, with Morise-san) am porting all the assets built on STRAIGHT (legacy and TANDEM) to Morise-san’s WORLD. Please find the following articles and the repository. I think the most usable F0 extractor for vocoders is Morise-san’s Harvest.

Hideki Kawahara, Masanori Morise, Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education, Acoustical Science and Technology, Article ID e24.43, Advance online publication June 13, 2024, Online ISSN 1347-5177, Print ISSN 1346-3969, https://doi.org/10.1250/ast.e24.43, https://www.jstage.jst.go.jp/article/ast/advpub/0/advpub_e24.43/_article/-char/en

Hideki Kawahara, Masanori Morise, Interactive tools for making vocoder-based signal processing accessible: Flexible manipulation of speech attributes for explorational research and education, Acoustical Science and Technology, 2024, Volume 45, Issue 1, Pages 48-51, Released on J-STAGE January 01, 2024, Online ISSN 1347-5177, Print ISSN 1346-3969, https://doi.org/10.1250/ast.e23.52, https://www.jstage.jst.go.jp/article/ast/45/1/45_e23.52/_article/-char/en

GitHub https://github.com/HidekiKawahara/worldGUItools

Best regards, Hideki PS: For the F0 extractor comparison, please find the following movie and the reference. Movie https://www.youtube.com/watch?v=iXnP1tIuVic

Kawahara, H., Yatabe, K., Sakakibara, K.-I., Kitamura, T., Banno, H., Morise, M. (2022) An objective test tool for pitch extractors' response attributes. Proc. Interspeech 2022, 659-663, doi: 10.21437/Interspeech.2022-800 https://www.isca-archive.org/interspeech_2022/kawahara22_interspeech.html

2024/08/15 10:04、Kesh Ikuma @.***>のメール:

Thank you for your quick reply. the following article (Interspeech2005), although it does not have detailed descriptions. Yes, this is the paper that I first looked up, but unfortunately it lacks the description of the modification. Nonetheless, thank you for the information, and I'll continue porting the algorithm to Python as implemented. By the way, I'm planning on releasing it publicly on PyPI (if you're not familiar with Python, it's a rough equivalent to Matlab File Exchange) unless you have an objection. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Hideki Kawahara, Emeritus Professor, Wakayama University, Japan https://web.wakayama-u.ac.jp/~kawahara/index_e.html