Some questions about this work

Hi, Thank you for your nice paper and code. Your work is very solid. While going through your work, I had a few questions and would appreciate it if you could clarify them when you have a moment.

Regarding Task Definition and Labels: In the 19LA dataset, there are spoofing methods from A01 to A19, which you’ve mapped to a 25-dimensional vector in this work. From what I see, the 25 dimensions align with the seven main categories and subcategories outlined in the official 19LA paper, as shown in your code: data = {"A01": [1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0], "A02": [1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0], "A03": [1,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0], "A04": [1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0], "A05": [0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0], "A06": [0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1], "A07": [1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0], "A08": [1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0], "A09": [1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0], "A10": [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "A11": [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "A12": [1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0], "A13": [0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0], "A14": [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0], "A15": [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0], "A16": [1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0], "A17": [0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0], "A18": [0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "A19": [0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1]} which is refer to this Table in official 19LA paper:

In the first stage, you extract this 25-dimensional vector to be used for decision-tree attribution in the second stage. I was wondering if, in this setup, the category for genuine speech is also included within this 25-dimensional vector in the first stage? If not, I’d be curious to understand how this is addressed the Spoofing Detection task in the second stage.

Calculation for Spoofing Attack Attribution: I noticed that each audio sample, like A01, is associated with multiple subcategories indicated by 1s in a vector like [1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0]. For an attribution to be considered correct in your accuracy and F1 score metrics (in Spoofing Attack Attribution task), does the model need to predict all 1 and 0 accurately for a given sample?
Training and Testing Protocol: I noticed that the results for Spoofing Attack Attribution are promising. I was curious whether these results were achieved by training on A01-A06 and then testing on A07-A16 within the 19LA dataset, or if there was another testing protocol applied here.

Thank you very much for your time in addressing these questions.

I'm also curious about the first question. It looks like no real (bonafide) samples are used for training the attribute classifiers. In fact, the attribute classifiers are constrained to predict on the probability simplex of each attribute set, meaning that in phase 2 the real samples will be implicitly assigned (soft) attributes of the attacks. Maybe the detection task works because the real samples will be assigned attribute combinations that are not encountered for any of the other attacks. But I would like the authors to confirm this hypothesis.

Relatedly, it's also not obvious to me why the detection performance is not correlated at all with the attribute characterization performance: e.g., SSL-AASIST works best for detection (Table I), but worst for attribute characterization (Figure 3).

Regarding

Calculation for Spoofing Attack Attribution

For this task, I don't think they use the attribute labels, but the attack categories A01, A02, A03, etc. Recall that their model is x → e → p → a, where

x → e is the embedding extraction step with AASIST and friends;
e → p is the attribute characterization step with an MLP;
p → a is the attack classification with a decision tree.

Regarding

Training and Testing Protocol

For the evaluation of attack attribution on the test set, they mention in caption of Table II that they have used only attacks A16 (which is the same as A04) and A19 (which is the same as A06).

But I think what you are alluding to, and what would be very interesting, is to use the attributes to describe the unseen attacks (A07, A08, A09, etc.) and carry the evaluation at attribute level. This would answer the question of whether we can describe an unseen attack in terms of its attributes.

Hi @xieyuankun, thank you for looking into our work. To answer your questions,

Regarding Task Definition and Labels: In the first stage, you extract this 25-dimensional vector to be used for decision-tree attribution in the second stage. I was wondering if, in this setup, the category for genuine speech is also included within this 25-dimensional vector in the first stage? If not, I’d be curious to understand how this is addressed the Spoofing Detection task in the second stage.

The objective here is to characterize spoofed speech based on known attributes related to its generation, which enhances explainability compared to raw countermeasure (CM) embeddings. In this work, we achieve this by using an MLP to generate probability scores (continuous values between 0 and 1), referred to as probabilistic attribute embeddings.

Since the focus is on characterizing spoofed speech based on generation attributes, genuine speech, and any corresponding attributes are excluded during training.

Although the probabilistic attributes are designed to reflect generation-related attributes of spoofed speech, the embeddings for genuine speech are not expected to exhibit a steep distribution of 0 or 1. This is because genuine speech does not follow the attributes used to describe spoofed speech generation. As a result, these embeddings contribute to effective spoofing detection in stage 2.

@danoneata, you are right. Thanks for sharing your understanding here.

Calculation for Spoofing Attack Attribution: I noticed that each audio sample, like A01, is associated with multiple subcategories indicated by 1s in a vector like [1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0]. For an attribution to be considered correct in your accuracy and F1 score metrics (in Spoofing Attack Attribution task), does the model need to predict all 1 and 0 accurately for a given sample?

As I mentioned above, the MLPs’ predicted SoftMax values need not be exactly 0 or 1.

Training and Testing Protocol: I noticed that the results for Spoofing Attack Attribution are promising. I was curious whether these results were achieved by training on A01-A06 and then testing on A07-A16 within the 19LA dataset, or if there was another testing protocol applied here.

Rightly noticed by @danoneata, the evaluation of attribution is performed only with A16 (same as A04 in training) and A19 (same as A06 in training) attacks.

@xieyuankun we hope this settles your doubts. 🙂

Hi @danoneata, thank you for your interest and insights. As you mention,

Relatedly, it's also not obvious to me why the detection performance is not correlated at all with the attribute characterization performance: e.g., SSL-AASIST works best for detection (Table I), but worst for attribute characterization (Figure 3).

This is because CM systems are designed for spoofing detection rather than spoofing attack attribution. In this process, the AASIST system retains traces of attack types within its CM embeddings, while the SSL-AASIST system generalizes across different attack types.

You can observe from the figure that both systems achieve good separation between genuine and spoofed speech. However, across attack types, AASIST embeddings form distinct clusters, whereas SSL-AASIST embeddings exhibit significant overlap. This overlap may be due to the SSL model being pre-trained on a different dataset, allowing it to generalize across attack types when fine-tuned on ASVspoof2019 for spoofing detection.

Regarding,

But I think what you are alluding to, and what would be very interesting, is to use the attributes to describe the unseen attacks (A07, A08, A09, etc.) and carry the evaluation at attribute level. This would answer the question of whether we can describe an unseen attack in terms of its attributes.

It is indeed noteworthy to try describing the unseen attacks as a combination of known features. In this regard, Section 5.3 of my Master’s thesis titled ‘Opening the Black Box for Attribution of Spoofed Speech’ might be an interesting read for you.

We are currently focusing on unseen attacks, and you may find some intriguing observations in our upcoming publications. 🙂

Thanks for sharing your thesis, @Manasi2001! That looks certainly very interesting and I'll have a careful read.

Just wanted to mention that I've did a different analysis to yours, but in the same direction of analyzing the unseen systems. Concretely, I've looked at how well your method predicts each attribute set for the unseen attacks. Using the provided attribute probabilities we can easily check whether the correct attribute was predicted even for an unseen attack (as long as the attribute was encountered in training attacks). My code is here.

The results show the accuracy per attribute set and for each attack.

Screenshot 2024-12-02 at 15 16 14

Legend:

Green indicates seen attacks, blue indicates unseen attacks.
Missing values indicate that the attribute (e.g., DTW for A13) is not in the training attribute set (e.g., AS3 duration).

We see that:

AS1 is predicted well on the unseen attacks.
For AS2 the model struggles with the WORLD input processor, which is used by A13 and A17.
For AS3 the results are fairly mixed, while for the rest (AS4–7) the model generally struggles.

I don't want to sidetrack this thread even further, but your observation of "describing the unseen attacks as a combination of known features" and the mango–orange example in your thesis reminded me of the work done in the computer vision community from the early 2010s [1, 2, 3]: all these methods used attributes to build object classifiers with no training data. I was thinking that a similar idea could be used here in the context of deepfake detection. Unfortunately, it seems to me (per the previous comment) that the attribute classifiers are not robust enough on unseen attacks to make such a claim.

[1] Lampert, Christoph H., Hannes Nickisch, and Stefan Harmeling. "Learning to detect unseen object classes by between-class attribute transfer." CVPR, 2009. [2] Farhadi, Ali, et al. "Describing objects by their attributes." CVPR, 2009. [3] Lampert, Christoph H., Hannes Nickisch, and Stefan Harmeling. "Attribute-based classification for zero-shot visual object categorization." PAMI 36.3 (2013): 453-465.

Manasi2001 / Spoofed-Speech-Attribution

Some questions about this work #1