Open Lissanro opened 3 years ago
I have also tested locally original GPT-J-6B on my CPU. And the quality of the output is as high as the online version. Unfortunately, quality of the output from the model of this repository is always awful. It seems worse to me than running smaller transformer which could fit in 8GB VRAM without tricks. If somebody getting good results with this repository, please post some examples as a reference so others would know if this is really possible. I think either I'm doing something wrong, or this repository code does not really work.
Hi, thanks for bringing this up
Am I doing something wrong, or severe reduction of quality is a consequence of RAM/VRAM memory savings? The vram savings don't cause a loss in quality, since the computations are the same. Only the weights are transferred around.
The most likely cause is that the weights are not accurate at fp16 / I have bad weights on the gdrive.
If you have the original 32 bit weights or 16 bit weights from a more reliable source, you could try replacing the model loading with whatever you have and try running it?
I'm using locally original BF16 "slim" weights and they work great so they are sufficiently accurate for inference. But I'm not sure how to use them with your code. After unpacking original weights, I have a directory step_383500 with shard_0..7 folders, each contains .npz files.
I know that README mentions https://github.com/arrmansa/saving-and-loading-large-models-pytorch/blob/main/Pickle.ipynb which was used to convert the weights but I have no idea what to assign to the model
variable.
To get original GPT-J-6B working with CPU, I have used https://github.com/kingoflolz/mesh-transformer-jax/blob/master/resharding_example.py - I just ran the code as is, got warning "No GPU/TPU found, falling back to CPU" and after a while I got the output. I also tested few more times infer()
function with different input text to make sure I consistently get good results.
RAM usage was up to 66GB I think, so the system needs to have 128GB of RAM (perhaps 64GB with 64GB swap on SSD will also work). The original code is very inefficient on CPU, it uses two cores at most out of 16 (increasing cores_per_replica
to 16 breaks it, so not sure how to improve this). This makes testing hard and slow; if somebody knows how to use all CPU cores with original code, I would appreciate a suggestion, it would help to run more tests and compare results with the code in this repository.
If you cannot convert the original weights on your system, perhaps you could use code from resharding_example.py as an example to update your own Pickle.ipynb with complete code so it would load and save the weights in the format which works with your repository? I then will test it.
I've found some alternate pytorch checkpoints from https://rentry.org/jaxflawlessvictory
Easy Setup:
Local KoboldAI-ready Monolithic Pytorch checkpoint file: Checkpoint Converted by Author (.7z)
https://mega.nz/file/z8QARTYI#rpjb54-rQh-76hHVEapfLfvNohj-R-_YZp21X4g5QHI
Checkpoint Converted by KoboldAI Dev (.tar)
https://drive.google.com/file/d/1-3OM_3lpY_HZY4yeOqEBflpA5jK2Jnz_/view?usp=sharing https://mega.nz/file/vbwwzBCR#G6rVf7WT43MHKNjfFL5iidomsPlc3xz7Ogjv3K2P8ZA
Mirrored on Odyssee Seeded by Henk717 (KoboldAI Contributor) (.7z)
These are loaded as -
model = torch.load(PATH)
I can't guarantee that these will work since I cannot try it myself because I have only 16gb ram, and loading from a checkpoint like this takes >30gb ram.
I'd like to point out that your timings (1.6 s /token) matches the timings I'm getting on a CPU only server with Hugginface Transformers library (which uses the 24GB Version of the Model), what's the role of the Vram in your project here? It seems theres no difference in performance compared to my RAM/CPU only usecase
I'd like to point out that your timings (1.6 s /token) matches the timings I'm getting on a CPU only server with Hugginface Transformers library (which uses the 24GB Version of the Model), what's the role of the Vram in your project here? It seems theres no difference in performance compared to my RAM/CPU only usecase
Hi, what cpu are you using and how much context are you giving it? This code was mainly made so that it could handle 2000 context prompts in reasonable time (1.6 s/token).
my measurements are on a Xeon Gold 6126 2,6GHz, my context is roughly similar.
Not sure if relevant, but code seems to use half-range floats, should it be half-precision (float16)? Just a thought.
also having the same issue, model is unusable with these weights because results are highly inferior. anybody find a solution?
did anyone find a solution to this?
Not exactly a solution, but there's a dev-version of KoboldAI that allows to split the ML workload between GPU and CPU: https://github.com/henk717/KoboldAI
This version works with hfj
-models that are found here:
https://storage.henk.tech/KoboldAI/
I'm using it with gpt-hfj-6b-adventure
model on 6GB VRAM and it works correctly.
Even though memory savings are great, I hoped that the quality will be the same, but it is not. For example, on https://6b.eleuther.ai/ I try the following prompt (highlighted in bold) and get decent result (:
But with this repository, results are consistently bad (in both cases top-p=0.9 and temperature=1, but I also tried default repository parameters, it generates nonsense too), I generated 30 tokens at a time:
Original GPT-J-6B does not lose the context and overall quality of each sentence is much higher. But GPT-J-6B from this repository, even is some cases when it does not lose the context right away, just generates nonsense, sometimes even of worse quality than what shown above.
Am I doing something wrong, or severe reduction of quality is a consequence of RAM/VRAM memory savings? If the latter is the case, I suggest putting a warning about this in the README.
I have used RTX 2060 SUPER 8GB (with no connected displays, so it has all the memory free), my CPU is 5950X (16 core) and I have 128GB of RAM. The biggest limit in my case is VRAM, I guess I could run original GPT-J-6B on CPU-only, but I hoped to use my GPU so I tried this repository first.