Closed xinyixuu closed 3 months ago
Okay just tried with cpu based implemetation, and replicating the segmentation fault with gpu.
Noticed it required a lot of memory, tried running on the tiny_sherlock_audio_01_part000.mp3 on cpu, and seems it worked.
Will retry with gpu, but I think this may be related to size of audio sample.
Okay, always amazed how much faster GPU is vs CPU, it's taking just 1-2 seconds per word now vs 1-2 minutes per word with CPU.
Segmentation fault appears to be avoided after targeting the 5.0MB file : )
Any luck getting past the segmentation fault?
If not I can try to run on my machine just to support with the preprocessing step.
Also curious what you would like to add
Just added two input files for testing
the directory of the program is under ~/data/snac
The command I used to run the program is:
python3 sample_whisper_snac.py tiny_sherlock_audio_01_part000.mp3 tiny_sherlock_audio_01.json
How are you setting up whisper?
I'd like to recreate the tiny_sherlock_audio_01.json file
How are you setting up whisper?
I'd like to recreate the tiny_sherlock_audio_01.json file
I just add the bash script that creates the output for this json file. Notice this script is an unaccomplished one. Specifically, the command: ./main -m ./models/ggml-base.en.bin -f "${input_audio}" -ml 1 -oj -of "${out_path}"
gives the result
Okay, I just created a pull request patch: https://github.com/xinyixuu/nanoGPT/pull/1
Please merge and also remove the audio files then I'll merge this commit in, good work.
Next task, is to target the direct of 5MB files, and preferably do these one by one, appending to the json output.
You can either try to do each one serially (first process with whisper for json timestamps, then process the snac tokens for these with sample_whisper_snac.py
) or create a directory for all of the whisper timestamp files (maybe with same names as the 5MB files just with a json), then one last pass for processing each of these with their corresponding json files.
Let me know if you have any questions.
There are several .mp3 files, .json files and python code.
the tiny_sherlock_audio_01.json is the results from whisper on tiny_sherlock_audio_01.mp3 file.
faster_whisper_snac.py
is the program related to using faster_whisper to do the mapping from word to snac. You can usebash snac_dataset.sh
command to run the whole process.sample_whisper_snac.py
takes in one audio file and one whisper json file to do the word to snac mapping. Usingpython3 sample_whisper_snac.py tiny_sherlock_audio_01.mp3 tiny_sherlock_audio_01.json
to run this program.example.py
is the simple debug file I use, the code in this file is exactly what I have sent through the chat. The way to run it:python3 example.py tiny_sherlock_audio_01_part000.mp3
(Using the split audio file here because the original sherlock_audio file is too big for this program to run)