Open roopeshn28 opened 1 year ago
@BeauGeogeo That's great that you were able to successfully extract phonemes. I also tried pocketsphinx but got poor results. I am in the process of trying allosaurus. Could you share the code and ipa-to-cmu map you are using to extract phonemes accurately?
@BeauGeogeo That's great that you were able to successfully extract phonemes. I also tried pocketsphinx but got poor results. I am in the process of trying allosaurus. Could you share the code and ipa-to-cmu map you are using to extract phonemes accurately?
Hello @andrewkuo. Unfortunately I cannot share the code I'm sorry, as it is not allowed by the corporation I'm currently working for.
But I can give you some indications. For the mapping, you can go to Papagayo repository they have a fair enough mapping. On the net you can find other github issues adressing this problem, and sometimes I checked by myself the IPA phonemes and try to find the closest one in CMU.
Then, you use Librosa to segment the audio thanks to the split function. For the timestamps corresponding to voice activity, you use allosaurus on them. Allosaurus is pretty good, just get the emission parameter a bit higher than 1 I think, you'll get some "noisy" phonemes but can catch some who are missing with the default parameter. For the period of silence (between the timestamp ending a voice activity detected and the timestamp starting the following voice activity) you have to add silence (and add the empty symbol which is one way to represent silence). Finally, you perform the mapping and then use the parsing phonemes function of AVCT (mentionned by styletalk) to have your phonemes sequence, but be carefull you have some values to change in the AVCT function due to the fact that the sampling rate is not the same. You'll have to change 25 by 30 and < 4 by <3 or 3.3 something like that, you can make some experiences and find the best values.
Last but not least, we got not too bad results, but clearly not as accurate as StyleTALK. BUT I think it is only an issue of re-training the model. Indeed, the phonemes of allosaurus are not always the same as CMU, and in particular the duration are not the same. It means that no matter how good is your mapping, phoneme detection and audio segmentation, the weights of the model are not suited for the sequences produced because it saw different patterns during training, hence you'll have to retrain the model.
For the training, you can keep the IPA-CMU mapping, it is the easiest way. You can get better results I think by keeping IPA symbols and using a different word embedding than styleTALK, ie an IPA-based one. This step I did not have the time to check it but will do it in the following weeks I think.
I hope it helps, let me know if you have additional questions or if you get interesting results.
Best regards
@BeauGeogeo thank you very much for posting your process. I will implement this and see if I can get good results. I'll let you know if I find any other good techniques.
@BeauGeogeo I have gotten some good results with your direction. Thank you so much!
Have you encountered jittering and zoom in/zoom out effects from 3DMM pose coefficients from custom videos? I followed PIRenderer DatasetHelper to extract 3DMM coeff for videos. For the most part it is pretty good, but every once in a while during the output video the face jitters/zooms in/out. Have you encountered this issue?
I'm wondering if it is because I have not preprocessed the initial video correctly. I currently use FOMM's crop-video.py helper to crop initial video.
Hi @andrewkuo , glad to hear it, you're welcome !
Firs of all, I want to let you know that I have modified a bit my solution. I had too many cases of silence periods and so empty phonemes list and my code was becoming messy to handle them, so I wrote a simpler yet more effective code. I make librosa pretty sensitive to voice detection, and I only store in a list the periods of voice activity where allosaurus detects phonemes. Then, I just have to do a loop, I check if I have to add silence at the beginning and at the end, and for the rest I just add silences in between the period of activities. I set the duration of silence according to bg of current voice activity period and ed of previous voice activity period (and according to start and end of the audio for the particular cases of beginning and end of it of course).
Moreover, allosaurus working fine + the good mapping are able to produce good results even withtout retraining contrary to what I thought first, and apparently you have witnessed the same thing as I guess you did not re trained the model ?
Now that you mention it, I have the case of just one moment of jittering in a video among the 3 I produced. It is a video of yann lecun speaking french, and he's doing a lot of silences and 'huuuh' sounds so this time maybe it is really a question of odd sequence the model has never seen before so it might fail a little bit but I'm not sure. It happens only in this video and for a short duration but indeed it happens. In my case I do not use any cropping, I just produce the video with styletalk inference file.
It could be interesting to share some of the videos and phonemes files we produced by mail, let me know if you are interested.
Best regards
@BeauGeogeo I have gotten some good results with your direction. Thank you so much!
Have you encountered jittering and zoom in/zoom out effects from 3DMM pose coefficients from custom videos? I followed PIRenderer DatasetHelper to extract 3DMM coeff for videos. For the most part it is pretty good, but every once in a while during the output video the face jitters/zooms in/out. Have you encountered this issue?
I'm wondering if it is because I have not preprocessed the initial video correctly. I currently use FOMM's crop-video.py helper to crop initial video.
Sorry @andrewkuo I realized I missed the point in my previous answer (not a good idea to answer when you've juste wake up ^^). I don't remember witnessing the effects you are talking about. I have sometimes lot of artifacts for challenging identities or when the head pose is extreme or/and when the pose chosen is really different from the one of identity. But jittering and zoom in zoom out no. I used PIRenderer preprocessing and code to extract 3DMM coeffs from custom driving videos but to generate a video with styletalk I just give the inputs without any particular preprocessing, I just sometimes resize the image to 256x256 and convert it to png.
@BeauGeogeo No worries. Thanks for responding. I actually figured out why my jittering and zoom in zoom out was happening. My source video had some weird mouth movements which caused some oddities with styletalk output. I have tested with other videos and am not getting that issue anymore.
Oh OK great, thank you for this information and glad you solved your problem !
@roopeshn28 Could you elaborate a bit plz ?
To produce your own style coefficients you need to get the 3DMM coefficients of the driving style video. You can follow PIRenderer to know how to do it. Sometimes it is a bit tricky to install all the libraries but I was able to do it so it can work.
For the phonemes you have to use a phoneme detector and put the result in the same format as in AVCT. I tried a lot of things with pocketsphinx but was unable to reproduce as good results as styletalk, so I moved to a custom solution using librosa to segment speech and add silences when needed, and then using allosaurus to detect the phonemes, and doing a mapping with CMU sphinx dictionary (but you could also replace the word embedding by a IPA-based one and retrain the model like this).