NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.77k stars 882 forks source link

GPT-NeoX gives poor results using FP16 #602

Open eycheung opened 1 year ago

eycheung commented 1 year ago

Branch/Tag/Commit

main

Docker Image Version

none

GPU name

T4

CUDA Driver

525.60.13

Reproduced Steps

## Steps
1. Download public GPT-NeoX Model https://huggingface.co/EleutherAI/pythia-70m
2. Convert checkpoint using `huggingface_gptneox_convert.py`
3. Run the example file https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptneox/gptneox_example.py
with input file that contains: "What is the boiling point of water?"

## Environment
* torch version '2.0.1+cu117'
* transformers version '4.29.0'

## Inference Settings
* `fp16` for inference inference_data_type
* beam_width = 1
* output_len = 60
* repetition_penalty = 1.1

Rest of params used default values from HuggingFace's [GenerationConfig](https://huggingface.co/docs/transformers/v4.29.0/en/main_classes/text_generation#transformers.GenerationConfig).

## Result
I observed a lot of nonsense tokens being added. It works fine with fp32.
byshiue commented 10 months ago

FasterTransformer development has transitioned to TensorRT-LLM.

Can you take a try?