Open moonscar opened 1 year ago
You can refer the example with this flag https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gpt/multi_gpu_gpt_example.py#L138, which only encapsulate the gpt transformer layers.
I have seen this sample code. The use_gpt_decoder_ops parameter will cause the model to use the ParallelGPT class instead of the Gpt class, but both models use torch.classes.FasterTransformer.DynamicDecodeOp during inference, and the layers of gpt cannot be separated separately in this module. Am I missing some information?
It can use custom sampling kernels, but you need to handle the cache update correctly. To be convenient, we use DynamicDecodeOp directly.
We are using a non-standard GPT module, so we cannot use DynamicDecodeOp directly. Is there a way to use the sample kernels alone ?
As I describe, you can use custom sampling kernels, but you need to handle the cache update. It would be not too hard under sampling case.
In the examples folder, fastertransformer shows how to use a whole model. My model structure has undergone some complex modifications, and it's hard for me to directly use these models. Can I build a .so separately so that the sample layer can be used separately ?