NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.84k stars 891 forks source link

Can I use sample layer separately ? #477

Open moonscar opened 1 year ago

moonscar commented 1 year ago

In the examples folder, fastertransformer shows how to use a whole model. My model structure has undergone some complex modifications, and it's hard for me to directly use these models. Can I build a .so separately so that the sample layer can be used separately ?

byshiue commented 1 year ago

You can refer the example with this flag https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gpt/multi_gpu_gpt_example.py#L138, which only encapsulate the gpt transformer layers.

moonscar commented 1 year ago

I have seen this sample code. The use_gpt_decoder_ops parameter will cause the model to use the ParallelGPT class instead of the Gpt class, but both models use torch.classes.FasterTransformer.DynamicDecodeOp during inference, and the layers of gpt cannot be separated separately in this module. Am I missing some information?

byshiue commented 1 year ago

It can use custom sampling kernels, but you need to handle the cache update correctly. To be convenient, we use DynamicDecodeOp directly.

moonscar commented 1 year ago

We are using a non-standard GPT module, so we cannot use DynamicDecodeOp directly. Is there a way to use the sample kernels alone ?

byshiue commented 1 year ago

As I describe, you can use custom sampling kernels, but you need to handle the cache update. It would be not too hard under sampling case.