arrmansa / Basic-UI-for-GPT-J-6B-with-low-vram

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.
Apache License 2.0
114 stars 12 forks source link

The results are much worse than with original GPT-J-6B #2

Open Lissanro opened 3 years ago

Lissanro commented 3 years ago

Even though memory savings are great, I hoped that the quality will be the same, but it is not. For example, on https://6b.eleuther.ai/ I try the following prompt (highlighted in bold) and get decent result (:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. As first reported in Andean News, “The unicorns live in the rural valley and are one of many animal species native to the mountains and can be found to this day, but they are a rare occurrence.” One of the scientists at the scene said, “They were so different from anything else that we had seen in our lifetime, so it was a surprise.”

The scientists were able to capture several of the unicorns and identified them as the first specimens ever found, and one unicorn was even carrying a pink umbrella. Additionally, it was found that a human female had been kidnapped by one of the unicorns and that the herd had a protector, a man who travels with them. One of the scientists said, “He had just given us the run of the valley because he didn’t want us to disturb the unicorns. We all know now that’s not going to happen.” It is hoped that the kidnapping is something of a sign that the humans and the unicorns can coexist, and there have been some initial concerns that the unicorns are not quite so friendly as they first seemed, for they refused to let anyone near the big udder.

But with this repository, results are consistently bad (in both cases top-p=0.9 and temperature=1, but I also tried default repository parameters, it generates nonsense too), I generated 30 tokens at a time:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. So we found that these authors do not exist, that's why and how many years

"Yes, please?"

So much so fastidious house dust is created, you know what the author of "The Complete Book of Envato the Dream Chaser Wartime-Tropomorrah@candy_bunny-rraaaayyyyyy.... What the hell were those people who think the world around them have been taken aback when you told a lie. We are not supposed to speak the truth... but the facts do exist. The main problem with the human race is that they forget where they got their truth from. They say the whole universe is one big Lie the main source of all this universe is our perception; i.e., We are only creatures on our senses do not know what the hell they are. They think they are not supposed to know. It has no awareness of where it came from. What makes a rose flower look like an orchid, if it had some life

Original GPT-J-6B does not lose the context and overall quality of each sentence is much higher. But GPT-J-6B from this repository, even is some cases when it does not lose the context right away, just generates nonsense, sometimes even of worse quality than what shown above.

Am I doing something wrong, or severe reduction of quality is a consequence of RAM/VRAM memory savings? If the latter is the case, I suggest putting a warning about this in the README.

I have used RTX 2060 SUPER 8GB (with no connected displays, so it has all the memory free), my CPU is 5950X (16 core) and I have 128GB of RAM. The biggest limit in my case is VRAM, I guess I could run original GPT-J-6B on CPU-only, but I hoped to use my GPU so I tried this repository first.

Lissanro commented 3 years ago

I have also tested locally original GPT-J-6B on my CPU. And the quality of the output is as high as the online version. Unfortunately, quality of the output from the model of this repository is always awful. It seems worse to me than running smaller transformer which could fit in 8GB VRAM without tricks. If somebody getting good results with this repository, please post some examples as a reference so others would know if this is really possible. I think either I'm doing something wrong, or this repository code does not really work.

arrmansa commented 3 years ago

Hi, thanks for bringing this up

Am I doing something wrong, or severe reduction of quality is a consequence of RAM/VRAM memory savings? The vram savings don't cause a loss in quality, since the computations are the same. Only the weights are transferred around.

The most likely cause is that the weights are not accurate at fp16 / I have bad weights on the gdrive.

If you have the original 32 bit weights or 16 bit weights from a more reliable source, you could try replacing the model loading with whatever you have and try running it?

Lissanro commented 3 years ago

I'm using locally original BF16 "slim" weights and they work great so they are sufficiently accurate for inference. But I'm not sure how to use them with your code. After unpacking original weights, I have a directory step_383500 with shard_0..7 folders, each contains .npz files.

I know that README mentions https://github.com/arrmansa/saving-and-loading-large-models-pytorch/blob/main/Pickle.ipynb which was used to convert the weights but I have no idea what to assign to the model variable.

To get original GPT-J-6B working with CPU, I have used https://github.com/kingoflolz/mesh-transformer-jax/blob/master/resharding_example.py - I just ran the code as is, got warning "No GPU/TPU found, falling back to CPU" and after a while I got the output. I also tested few more times infer() function with different input text to make sure I consistently get good results.

RAM usage was up to 66GB I think, so the system needs to have 128GB of RAM (perhaps 64GB with 64GB swap on SSD will also work). The original code is very inefficient on CPU, it uses two cores at most out of 16 (increasing cores_per_replica to 16 breaks it, so not sure how to improve this). This makes testing hard and slow; if somebody knows how to use all CPU cores with original code, I would appreciate a suggestion, it would help to run more tests and compare results with the code in this repository.

If you cannot convert the original weights on your system, perhaps you could use code from resharding_example.py as an example to update your own Pickle.ipynb with complete code so it would load and save the weights in the format which works with your repository? I then will test it.

arrmansa commented 3 years ago

I've found some alternate pytorch checkpoints from https://rentry.org/jaxflawlessvictory

Easy Setup:

Local KoboldAI-ready Monolithic Pytorch checkpoint file: Checkpoint Converted by Author (.7z)

https://mega.nz/file/z8QARTYI#rpjb54-rQh-76hHVEapfLfvNohj-R-_YZp21X4g5QHI

Checkpoint Converted by KoboldAI Dev (.tar)

https://drive.google.com/file/d/1-3OM_3lpY_HZY4yeOqEBflpA5jK2Jnz_/view?usp=sharing https://mega.nz/file/vbwwzBCR#G6rVf7WT43MHKNjfFL5iidomsPlc3xz7Ogjv3K2P8ZA

Mirrored on Odyssee Seeded by Henk717 (KoboldAI Contributor) (.7z)

https://odysee.com/@henk717:1/gpt-j-6b-hf:c

These are loaded as - model = torch.load(PATH)

I can't guarantee that these will work since I cannot try it myself because I have only 16gb ram, and loading from a checkpoint like this takes >30gb ram.

chris-aeviator commented 3 years ago

I'd like to point out that your timings (1.6 s /token) matches the timings I'm getting on a CPU only server with Hugginface Transformers library (which uses the 24GB Version of the Model), what's the role of the Vram in your project here? It seems theres no difference in performance compared to my RAM/CPU only usecase

arrmansa commented 3 years ago

I'd like to point out that your timings (1.6 s /token) matches the timings I'm getting on a CPU only server with Hugginface Transformers library (which uses the 24GB Version of the Model), what's the role of the Vram in your project here? It seems theres no difference in performance compared to my RAM/CPU only usecase

Hi, what cpu are you using and how much context are you giving it? This code was mainly made so that it could handle 2000 context prompts in reasonable time (1.6 s/token).

chris-aeviator commented 3 years ago

my measurements are on a Xeon Gold 6126 2,6GHz, my context is roughly similar.

Deltrego commented 3 years ago

Not sure if relevant, but code seems to use half-range floats, should it be half-precision (float16)? Just a thought.

liuzzi commented 3 years ago

also having the same issue, model is unusable with these weights because results are highly inferior. anybody find a solution?

atl333 commented 2 years ago

did anyone find a solution to this?

z80maniac commented 2 years ago

Not exactly a solution, but there's a dev-version of KoboldAI that allows to split the ML workload between GPU and CPU: https://github.com/henk717/KoboldAI

This version works with hfj-models that are found here: https://storage.henk.tech/KoboldAI/

I'm using it with gpt-hfj-6b-adventure model on 6GB VRAM and it works correctly.