facebookresearch / spiritlm

Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".
Other
824 stars 53 forks source link

i can't generate audio #6

Open shingo-vokov opened 1 month ago

shingo-vokov commented 1 month ago

i try use it

outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "I am so deeply saddened, it feels as if my heart is shattering into a million pieces and I can't hold back the tears that are streaming down my face.")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
    speaker_id=1,
)
display_outputs(outputs)

but i see errors

/home/.conda/envs/spiritlm/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:579: UserWarning: pad_token_id should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set pad_token_id explicitly as model.generation_config.pad_token_id=PAD_TOKEN_ID to avoid errors in generation warnings.warn(

how to get PAD_TOKEN_ID

hitchhicker commented 1 month ago

Could you share the version of your transformers version and how you setup the conda environment? Thanks! I don't have this error in my side.

gallilmaimon commented 1 month ago

Hey @hitchhicker, I get the same warning when running the generation example.

I created the environment as explained for Conda:

conda env create -f env.yml
pip install -e '.[eval]'

And my version of transformers is transformers==4.46.0

hitchhicker commented 1 month ago

Hey @gallilmaimon for providing the information for the setup! I wonder whether this error happens only for python 3.10. I am using 3.9 by the way. What python version are you using?

gallilmaimon commented 1 month ago

I am using Python 3.9.20, and indeed the env.yml specifies python==3.9

hitchhicker commented 1 month ago

My python version is also 3.9.20 if the minor version is included.

The following is the output of my pip freeze

antlr4-python3-runtime==4.8
audioread==3.0.1
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
decorator==5.1.1
einops==0.8.0
encodec==0.1.1
exceptiongroup==1.2.2
fairscale==0.4.13
filelock @ file:///croot/filelock_1700591183607/work
fsspec==2024.9.0
gmpy2 @ file:///tmp/build/80754af9/gmpy2_1645438755360/work
huggingface-hub==0.25.1
idna==3.10
iniconfig==2.0.0
Jinja2 @ file:///croot/jinja2_1716993405101/work
joblib==1.4.2
lazy_loader==0.4
librosa==0.10.2.post1
llvmlite==0.43.0
local-attention==1.9.15
MarkupSafe @ file:///croot/markupsafe_1704205993651/work
mkl-service==2.4.0
mkl_fft @ file:///croot/mkl_fft_1725370245198/work
mkl_random @ file:///croot/mkl_random_1725370241878/work
mpmath @ file:///croot/mpmath_1690848262763/work
msgpack==1.1.0
networkx @ file:///croot/networkx_1717597493534/work
numba==0.60.0
numpy @ file:///croot/numpy_and_numpy_base_1725470312869/work/dist/numpy-2.0.1-cp39-cp39-linux_x86_64.whl#sha256=d86a49760b169e0c4fb8c00d248077a0474f640071b7ef584afd5ad4f03b9428
omegaconf==2.2.0
packaging==24.1
pandas==2.2.3
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
pyarrow==17.0.0
pycparser==2.22
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML @ file:///croot/pyyaml_1698096049011/work
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.13.1
sentencepiece==0.2.0
six==1.16.0
soundfile==0.12.1
soxr==0.5.0.post1
sympy @ file:///croot/sympy_1724938189289/work
threadpoolctl==3.5.0
tokenizers==0.20.0
tomli==2.0.2
torch==2.4.1
torchaudio==2.4.1
torchfcpe==0.0.4
tqdm==4.66.5
transformers==4.45.1
triton==3.0.0
typing_extensions @ file:///croot/typing_extensions_1715268824938/work
tzdata==2024.2
urllib3==2.2.3

I notice that my transformers version is transformers==4.45.1, which is different than yours.

gallilmaimon commented 1 month ago

I will try downgrading transformers and see if that makes any difference. I will also try to create the environment using pip and not anaconda and let you know if there is any difference.

In the pip installation it says:

pip install -e requirements.txt
pip install -e '.[eval]'

but the first line gives an error, I think it should be pip install -r requirements.txt or simply pip install -e ., is this correct?

hitchhicker commented 1 month ago

Thanks!

You are right, pip install -e requirements.txt has typo on "-e" it should be "-r". And we don't need this. I will update the readme.

gallilmaimon commented 1 month ago

I tried with transformers==4.45.1 like you (and also tried installing with pip instead of Conda), bus still got the same warning:

python3.9/site-packages/transformers/generation/configuration_utils.py:568: UserWarning: `pad_token_id` should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set `pad_token_id` explicitly as `model.generation_config.pad_token_id=PAD_TOKEN_ID` to avoid errors in generation

It is worth mentioning that (unlike the issue title) I am managing to generate audio:

[GenerationOuput(content=array([-0.00186113, -0.00068325, -0.0015525 , ..., -0.00764565,
       -0.0104003 , -0.01227323], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>)]
[GenerationOuput(content=' you think you can make it i shall be very glad if you can', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([ 0.04572767,  0.03901244,  0.03441606, ..., -0.18233861,
       -0.2093165 , -0.22660626], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>), GenerationOuput(content=' human being to see and suffer wrong without offering the help of his hand and his life now a new passion was in him and a new dignity as of one who stood upon the brink of a mighty change', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([-0.00298501, -0.00290124, -0.00230354, ..., -0.20166263,
       -0.20007564, -0.19413921], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>), GenerationOuput(content=' ragged clay stains were gone and his eyes', content_type=<ContentType.TEXT: 'TEXT'>)]

but the warning above indicates that I might get wrong results when working with batches which I would like to do...

hitchhicker commented 1 month ago

Awesome to see that you are able to generate outputs!

In fact, we don't really support batch prediction (one prediction can contain multiple texts, multiple audio or mixed of them, but they are still one batch) since the implementation of speech tokenizer does not support that. I see that you have a output of two lists. For each call of generate, we expect to see only one list. Would you mind share the input that you have used for the interleaved_inputs? Thanks!

gallilmaimon commented 1 month ago

The outputs are fine and make sense (there are two generate calls) :)

I wanted to calculate probabilities of speech only (non-interleaved) samples in batches to calculate sWUGGY metric (like in the paper) or other modelling metrics like SALMon (https://arxiv.org/abs/2409.07437), and doing so without batching can be slow. However, as this is already a bit out pf scope for this issue, I will do that and if I have the same warning or unexplained behaviour there I will open a new issue.

Thank you!