Added partial code for snac tokens - Githubissues

ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

MIT License

23 stars 17 forks source link

Added partial code for snac tokens #212

Closed xinyixuu closed 3 months ago

xinyixuu commented 3 months ago

There are several .mp3 files, .json files and python code.

the tiny_sherlock_audio_01.json is the results from whisper on tiny_sherlock_audio_01.mp3 file.

faster_whisper_snac.py is the program related to using faster_whisper to do the mapping from word to snac. You can use bash snac_dataset.shcommand to run the whole process.

sample_whisper_snac.py takes in one audio file and one whisper json file to do the word to snac mapping. Using python3 sample_whisper_snac.py tiny_sherlock_audio_01.mp3 tiny_sherlock_audio_01.jsonto run this program.

example.py is the simple debug file I use, the code in this file is exactly what I have sent through the chat. The way to run it: python3 example.py tiny_sherlock_audio_01_part000.mp3 (Using the split audio file here because the original sherlock_audio file is too big for this program to run)

gkielian commented 3 months ago

Okay just tried with cpu based implemetation, and replicating the segmentation fault with gpu.

Noticed it required a lot of memory, tried running on the tiny_sherlock_audio_01_part000.mp3 on cpu, and seems it worked.

Will retry with gpu, but I think this may be related to size of audio sample.

gkielian commented 3 months ago

Okay, always amazed how much faster GPU is vs CPU, it's taking just 1-2 seconds per word now vs 1-2 minutes per word with CPU.

Segmentation fault appears to be avoided after targeting the 5.0MB file : )

gkielian commented 3 months ago

Any luck getting past the segmentation fault?

If not I can try to run on my machine just to support with the preprocessing step.

Also curious what you would like to add

xinyixuu commented 3 months ago

Just added two input files for testing the directory of the program is under ~/data/snac The command I used to run the program is: python3 sample_whisper_snac.py tiny_sherlock_audio_01_part000.mp3 tiny_sherlock_audio_01.json

gkielian commented 3 months ago

How are you setting up whisper?

I'd like to recreate the tiny_sherlock_audio_01.json file

xinyixuu commented 3 months ago

How are you setting up whisper?

I'd like to recreate the tiny_sherlock_audio_01.json file

I just add the bash script that creates the output for this json file. Notice this script is an unaccomplished one. Specifically, the command: ./main -m ./models/ggml-base.en.bin -f "${input_audio}" -ml 1 -oj -of "${out_path}"gives the result

gkielian commented 3 months ago

Okay, I just created a pull request patch: https://github.com/xinyixuu/nanoGPT/pull/1

Please merge and also remove the audio files then I'll merge this commit in, good work.

Next task, is to target the direct of 5MB files, and preferably do these one by one, appending to the json output.

You can either try to do each one serially (first process with whisper for json timestamps, then process the snac tokens for these with sample_whisper_snac.py ) or create a directory for all of the whisper timestamp files (maybe with same names as the 5MB files just with a json), then one last pass for processing each of these with their corresponding json files.

Let me know if you have any questions.