CodeGemma-7b-it - Githubissues

Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

https://lightning.ai

Apache License 2.0

6.95k stars 733 forks source link

CodeGemma-7b-it #1272

Closed Andrei-Aksionov closed 1 month ago

Andrei-Aksionov commented 1 month ago

Hi there 👋

Google released three variants of Gemma model for code generation: 2b, 7b and 7b-it

2b and 7b variants are useful only for code completion, they require a special prompt and the output is not the best (see examples here).

7b-it is a much better model, more versatile and generates plausible outputs. Thus this PR adds only this model.

If there is a strong intent to also include 2b and 7b variants, I recommend to first add support for a custom prompt (provided as an argument).

rasbt commented 1 month ago

Wow, you already added it 🤯🫶

Andrei-Aksionov commented 1 month ago

There is an issue with hf-transfer and Python 3.9.19: https://github.com/huggingface/hf_transfer/issues/33 So we need to manually merge the branch.

rasbt commented 1 month ago

I see. Can you help with that @carmocca ?

Andrei-Aksionov commented 1 month ago

I might be wrong, but it looks like there is a missing pre-build wheel for Python 3.9. So whenever you try to install this package with this Python, it tries to build a wheel. Since it's written in Rust, it expects a corresponding toolset. Don't know why before there were no issues, maybe somewhat from HF team accidentally deleted the wheel? 🤷

It looks like hf-transfer==0.1.4 (Nov 6, 2023) can be installed in Python 3.9.19 without problems.

Andrei-Aksionov commented 1 month ago

I did tests in different virtual environments on my local machine (MacOS), but then rechecked in a Studio and it looks like the issue with non-existing wheel for Python 3.9 exists only for MacOS, not Linux. That means that the reason is somewhere else.

Andrei-Aksionov commented 1 month ago

When I ran installation in a verbose mode, I noticed that UV cannot find a suitable version for boto3: it starts with the newest version, checks compatibility, fails, picks an older version, check compatibility, fails, ...

In a Studio when I changed environment to Python 3.9.19 and used the same command (except for --system flag) it ran without issues. Bleeding edge ...

carmocca commented 1 month ago

Can I merge this Andrei? I imagine that these CI issues can be resolved separately

Andrei-Aksionov commented 1 month ago

Yep, go for it. It's only a CI issue. Users should not have any problems.

carmocca commented 1 month ago

Something broke with the tokenizer test since the merge: https://github.com/Lightning-AI/litgpt/actions/runs/8654742013/job/23751422665#step:7:1110

Andrei-Aksionov commented 1 month ago

Yep, I see. Weird. Now the question is how to debug it. I don't have a windows machine and don't want to use a virtual machine. Thought that github codespaces might provide different OS to choose from, but nope.

Andrei-Aksionov commented 1 month ago

I have found the issue: in open function the encoding wasn't specified which caused the error on a Windows machine. Don't know why none of the CI checks for this PR has failed. @carmocca Maybe the PR with a fix should also contain with open(..., encoding='utf-8') wherever it's not specified? I just not hugely confident with encodings. As I understand it should be save to specify this encoding for all non-binary files.

carmocca commented 1 month ago

I agree. It should always be added for Windows, where the default is different