-
Flash Attention can only be used with fp16 and bf16, not with fp32. Therefore, we should make flash attention optional in our codebase so that one can deactivate it during inference in exchange for hi…
-
I hope this message finds you well. First off, thank you for providing such an incredible project on large model inference. I've been utilizing it extensively and it's been instrumental for many of my…
-
Dear author, thank you for your excellent work. I would like to inquire when you plan to make all your code publicly available. I am looking forward to your reply. Thank you!
-
Attention mechanisms are widely used in deep learning models, particularly in large language models. And a flexible attention kernel can help users to build accelerated language models conveniently on…
-
Your work is commendable, demonstrating . The attention to detail and the clarity of your findings are truly impressive. I was particularly intrigued by the utilization of "mask.mat" and "mask_3d_shif…
-
Dear Dr. Han and Dr. Ye,
I have been greatly impressed by your work on the Agent Attention model, as detailed in your recent publication and the associated GitHub repository. The method of integrat…
-
The break occurs when I train the rtdetr-l model with 300 epoches going to 90, but when I use resume, the epoches start at 91 but the map,R and P value become 0 and stay at 0 。the training code as fol…
-
```
Yes, the approach presented in the ConsistentID paper could potentially be rearchitectured to find better solutions. Here are a few ideas for improving the architecture and methodology:
**Inte…
-
ChatGPT is based on the GPT-3 architecture, which is a transformer-based language model that uses self-attention mechanisms to generate text. The model is trained on a large corpus of text data using …
-
### Model description
Here is the model description
> gte-Qwen1.5-7B-instruct is the latest addition to the gte embedding family. This model has been engineered starting from the [Qwen1.5-7B](https:…