Closed dyhan316 closed 5 months ago
Hi, tokenization code is already present for Llama in the tokenization_helpers.py file. The method is somewhat involved and has a number of edge cases, but essentially we only give timepoints to tokens that appear at the end of words.
Oh I see! (I didn't care to look inside the ridge utils folder cuz I thought it wouldn't contain llama stuff) thank you!!
A few more question if you don't mind :
thank you again in advance :)
Also, a heads up : it seems that the lines from 355 onwards should be outside of the for loop (or else only the first story's downsampled featureseqs is returned!)
Do you have a place where you uploaded all the semantic features? it seems that the one in the box folder you provided is only for a specific layer for a specific model size that is downsampled. (I could run the code again, but I don't want to hurt the environment with redundant GPU computations)
We are happy to provide the other (worse-performing) layers used in the paper in the Box, however we have only saved the downsampled versions of the layers in the paper, owing to the file size of the non-downsampled versions. If you would like the non-downsampled versions, you are unfortunately going to have to run the code yourself. I will arrange to upload other features we have stored to the Box.
how did you parallelize the large LLAMA model? Actually, is there a code you used the run and save the whole LLM features (the one in the jupyter notebook seem to be limited a very small OPT model for cases when a model can fit into a single GPU).
We used a node on a supercomputer that had multiple A100s. The accelerate library makes it pretty easy to spread model weights across GPUs, if that is your question.
during feature extraction, did you use half precision or quantization? or did you use the full fp32?
We generally used fp16 to save on vRAM.
it seems that the transcripts do not have a period indicating the end of each sentences.. wouldn't this affect how the LLM understands text or is the effect minimal?
The transcripts are of spoken audio, which is inherently agrammatical. Others have had some success adding punctuation, but I have found that it doesn't really matter too much.
Thank you for the detailed response :)
Hello, I was just wondering,
How was the word times decided for each token? It appears that when you used GPT1 (in the Nature neuroscience paper) each word was a token, so the word times of each word could be directly mapped to the timing of the token (later to be resampled through Lanczos resampling).
However, this paper has models that use subtokenization (i.e. one word != one token). So I was wondering, how did you give each token a timepoint?
Thank you in advance for your answer :)
Actually, could you please provide the code you used for LLAMA?