X-LANCE / UniCATS-CTX-txt2vec

[AAAI 2024] CTX-txt2vec, the acoustic model in UniCATS
https://cpdu.github.io/unicats
57 stars 8 forks source link

Which vq-wav2vec checkpoint was used for data preprocessing? #10

Open bestasoff opened 5 months ago

bestasoff commented 5 months ago

Hello @cantabile-kwok ! Thanks for this amazing project and congratulations on acceptance in AAAI.

I have a question. What vq-wav2vec checkpoint was used for tokenizing the speech data?

I'm reproducing the data preprocessing and find that some of resulting labels of the same libritts files do not match.

Thank you again for that project!

cantabile-kwok commented 5 months ago

Thanks for your interest in our work! We use the kmeans version of vq-wav2vec trained on Librispeech provided by fairseq there. So it is the second row of that table. I should make that clearer in the README though.

Please tell me if that solves your problem : )

bestasoff commented 5 months ago

Hey @cantabile-kwok

Thank you for your response. Yep, it helped. Almost all labels now are matching.

I have one more question regarding the data preprocessing. What were the steps you did for getting the text, duration files. What are the mfa models you used for that?

If possible, can you share that code please?

Thank you!

cantabile-kwok commented 5 months ago

@bestasoff We are not using a pretrained MFA model; but in fact, we firstly train a 10ms frame shift alignment model in Kaldi (I believe MFA is quite similar though), and that will give us the phoneme transcription of each utt (a.k.a the text), and the duration per phoneme. We then split the silence labels into different groups according to duration thresholds as follows:

SIL1:dur <= 3 SIL2:3 < dur <= 5 SIL3: 5 < dur <= 9 SIL4: 10 < dur <= 15 SIL5: 16< dur <= 25 SIL6: dur > 25

But note that you don't have to necessarily obtain the same phone transcriptions as the provided one. For this stage, every text preprocessing tool can be used, and you only have to ensure the duration matches the frame shift of vq-wav2vec features (which is 10ms).

bestasoff commented 5 months ago

@cantabile-kwok Yep, I could do it with pretrained MFA ARPA model. It's not very accurate (as I read it's because of the ARPA phones format), but works.

Now I faced other issue. I want to train the model on bigger datasets so I need to make all data preparation scripts to work. All the scripts seem to work ok, but make_ppe.sh is not working. Whenever I run it with the vars provided in the script I get an error.

assert all([specifier.startswith("scp:") for specifier in args.rspecifier]), \
AssertionError: Currently we only support passing rspecifier in scp format.This is because using kaldiio.load_scp, we can ensure the lazy-loading strategy instead of storing all the feats in memoryAlthough this may sacrifice some speed but in this way arbitrarily large feats can be supported

It's because of the pitch_feats and energy_feats vars.

Can you please help me resolve it. Thank you!

cantabile-kwok commented 5 months ago

I didn't encounter this before, so could you show me more information about this error, like the whole log file? From the provided information, I guess the program demands the input string to start with "scp:", like "scp:/path/to/wav.scp". But it is also strange to me because the make_ppe.sh invokes some pure c++ and Kaldi programs, not involving python and kaldiio. So, I'd also like to know in what environment you are running make_ppe.sh.

bestasoff commented 5 months ago

@cantabile-kwok Yes, sure. Here is the whole log I get after running that command: bash local/make_ppe.sh data/dev_all test-log feats/normed_ppe/test

# utils/paste-feats.py --length-tolerance=2 "ark:compute-kaldi-pitch-feats --verbose=2 --config=conf/pitch.conf scp,p:test-log/wav.1.scp ark:- | process-kaldi-pitch-feats --add-normalized-log-pitch=false --add-delta-pitch=false --add-raw-log-pitch=true ark:- ark:- |" "ark:compute-mfcc-feats --config=conf/mfcc.conf --use-energy=true scp,p:test-log/wav.1.scp ark:- | select-feats 0 ark:- ark:- |" ark,scp:feats/normed_ppe/test/feats.1.ark,feats/normed_ppe/test/feats.1.scp 
# Started at Sun Feb 25 03:09:59 PM UTC 2024
#
Namespace(verbose=0, length_tolerance=2, compress=False, compression_method=2, rspecifier=['ark:compute-kaldi-pitch-feats --verbose=2 --config=conf/pitch.conf scp,p:test-log/wav.1.scp ark:- | process-kaldi-pitch-feats --add-normalized-log-pitch=false --add-delta-pitch=false --add-raw-log-pitch=true ark:- ark:- |', 'ark:compute-mfcc-feats --config=conf/mfcc.conf --use-energy=true scp,p:test-log/wav.1.scp ark:- | select-feats 0 ark:- ark:- |'], wspecifier='ark,scp:feats/normed_ppe/test/feats.1.ark,feats/normed_ppe/test/feats.1.scp')
Traceback (most recent call last):
  File "/.../UniCATS-CTX-vec2wav/utils/paste-feats.py", line 88, in <module>
    main()
  File "/.../UniCATS-CTX-vec2wav/utils/paste-feats.py", line 56, in main
    assert all([specifier.startswith("scp:") for specifier in args.rspecifier]), \
AssertionError: Currently we only support passing rspecifier in scp format.This is because using kaldiio.load_scp, we can ensure the lazy-loading strategy instead of storing all the feats in memoryAlthough this may sacrifice some speed but in this way arbitrarily large feats can be supported
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Feb 25 03:09:59 PM UTC 2024, elapsed time 0 seconds
cantabile-kwok commented 5 months ago

@bestasoff That is a bit weird, since the make_ppe.sh script only invokes the Kaldi command paste-feats instead of the python in the repository utils/paste-feats.py. Although these two pieces of code have the same intention, the python version does not support the syntax in make_ppe.sh. Hence, could you check that in make_ppe.sh, is it the command paste-feats or the python paste-feats.py that is working?