Hello, I have learned from the example of extracting features from speech using the AST model. I mimicked this example to extract features from new speech using my own model, and the shapes I obtained are all [1, 1214, 768]. However, I only want to get features similar to [1, 768]. So, I want to ask, are the features obtained from the final layer of AST all [1, 1214, 768]? Or have I made a mistake in my operation? Thank you for your assistance, and I look forward to your reply.
Hello, I have learned from the example of extracting features from speech using the AST model. I mimicked this example to extract features from new speech using my own model, and the shapes I obtained are all [1, 1214, 768]. However, I only want to get features similar to [1, 768]. So, I want to ask, are the features obtained from the final layer of AST all [1, 1214, 768]? Or have I made a mistake in my operation? Thank you for your assistance, and I look forward to your reply.