-
@stevebottos Thanks for the really great code! I am learning a lot. Quick issue, I am not able to get the check_zero_shot_results.ipynb working (it may be based on some older versions of your code). C…
-
Hi!
Thanks for the great work!
I encountered an issue during the pretraining stage.
I was fine-tuning the vision tower, the linear adapter, and the Large Language Model (LLM) in the pretraining sta…
-
**Suggested steps:**
* [ ] Define unsupervised learning tasks, i.e., learning tasks that don't required truth-level labels but instead relies solely on the reconstruction-level data. This is the same…
-
Enquanto não penso em outro meio mais apropriado para o relatório abaixo, tratarei de inserir o log de tudo que foi feito na pesquisa desde julho.
### Sprint 1
#24
Basicamente revisei o que …
-
- https://arxiv.org/abs/2102.03334
- 2021
視覚と言語の事前学習(VLP)は、様々な視覚と言語の共同下流タスクのパフォーマンスを向上させる。
現在のVLPのアプローチは、画像の特徴抽出プロセスに大きく依存しており、そのほとんどが領域スーパービジョン(例:物体検出)と畳み込みアーキテクチャ(例:ResNet)を含んでいます。
文献上では無視されてい…
e4exp updated
3 years ago
-
-
Please add the following four papers which use transformer backbones:
1. Egocentric Video-language pre-training and solves video-text retrieval, video classification, text-guided video grounding, t…
-
I am trying to use TrOCR for recognizing Urdu text from image. For feature extractor, I am using DeiT and bert-base-multilingual-cased as decoder. I can't figure out what will be the requirements if I…
-
微博内容精选
-
# 🐛 Bug
I'm trying to create a 1:1 config that can train a stable ViT-B with the MAE config (from appendix A.2).
Maybe I'm missing something (highly plausible), but when I use xformers instead …