SHI-Labs / Compact-Transformers

Escaping the Big Data Paradigm with Compact Transformers, 2021 (Train your Vision Transformers in 30 mins on CIFAR-10 with a single GPU!)
https://arxiv.org/abs/2104.05704
Apache License 2.0
495 stars 77 forks source link

Information about Text Classifier #73

Closed SethPoulsen closed 1 year ago

SethPoulsen commented 1 year ago

Your README in the nlp folder refers to a text classifier called "Transformer-Lite: Lightweight Transformer"

but I can't find info in the READMEs or in the paper about the text classification performance, or about running this model.

Everything I can find in the other READMEs and in the paper are all about image related tasks. Where can I find out more about the text classifier?

Thanks.

stevenwalton commented 1 year ago

Hi Seth, thanks for taking interest in our work.

Transformer-Lite: Lightweight Transformer

This just refers to that our baseline NLP model is a (single directional) encoder-only transformer. The architecture should remind you of BERT (not identical) or more accurately, a model composing only of the encoder (left side) of the network described in section 3 of AIAYN.

The NLP section is small and in Appendix G. Table 10 has results for AGNews, TREC, SST, IMDb, DBpedia. Short list here. We were not as thorough with this section as the focus is on vision. We can think of our "patch and embed" section of the ViT as (poorly) analogous to tokenization and embedding in the standard NLP frame. You can think of CCT's vision embedding as similar to "subword" tokenization, but just note to be careful with these, and any, analogies and these will not hold in the NLP experiments.

For the code, we could have probably organized a bit better but it is still under src. Here's the direct link. You should notice that there is almost nothing different (intentional) except that we call the TextTokenizer and our Embedder.

I hope this clears things up. Let us know if you have any other questions.

SethPoulsen commented 1 year ago

Thanks so much for your detailed response Steven, I'll have a look. I must have missed the appendix on first read through.

SethPoulsen commented 1 year ago

It looks like you also ran the vision focused models on the text classification data sets. Were the tokenizatin/vectorization methods used the same? Where are the details of that?

edit: oh, I see in the code now which models are being imported to use the CCT on text. Thanks!

stevenwalton commented 1 year ago

Great! If you don't mind, I'll close the issue then. If you have any further questions don't hesitate to re-open it or start a new issue (if new topic).

SethPoulsen commented 1 year ago

Have you released your hyperparameters/other training details that worked well for the Text models? I see in the paper you said you did a hyperparameter sweep and that you froze the embedding layer, but It looks like everything in /config and /examples is all about the vision models. How can I find out about which hyperparameters worked well and the method for freezing the embedding layer?

FYI I'm trying to replicate the Text classifier training so that I can extend it to my own domain--I am working on NLP for grading students written mathematical proofs.

stevenwalton commented 1 year ago

Training

It is mostly the same as the vision models. I'll quote from the paper for the differences

We treat text as single channel data and the embedding dimension as size 300. Aditionally, the convolution kernels have size 1. Finally, we include masking in the typical manner.

It's been over a year since we did those, but I don't recall doing anything different and the training script should have all the details. But architecture names specify the architectures. I just checked the private repo and some old files and I didn't see anything different at a quick glance. But I also do not recall doing an extensive hyper-parameter search (@alihassanijr can you verify?), so I also wouldn't be surprised if you could get improved performance.

For your application, we're definitely interested in the results! That's an interesting topic. But I do need to stress that the focus of CCT is on vision and because of that, I have a harder time giving advice on NLP. Specifically we were researching if we could change the tokenization and embedding stage of ViTs such that we could reduce memory usage, increase throughput, and improve upon accuracy (or be competitive) all at the same time. We always appreciate more citations, but I want to make sure you have the best tool for your research. The insights from our vision work may not be as useful for NLP tasks, where many of these problems don't exist (transformers are quite successful on small datasets without pre-training). If you're using images of the math equations as your input, I'd highly recommend CCT, but if you're inputting latex then I have a harder time providing intuition.

SethPoulsen commented 1 year ago

The problem I am running into now is that GloVe is outputting a tensor of floats but the embedding layer TextCCT starts with seems to be expecting a tensor of integers. Is there some configuration option I am missing?

NLP tasks, where many of these problems don't exist (transformers are quite successful on small datasets without pre-training)

Could you point me to any specific models? I liked your models because they were transformers with low parameter count and showed good performance on small data sets without pre-training. Any other transformers I can find that perform well on a small data sets have huge parameter counts and must have been pre-trained on some huge data set beforehand, which I am trying to avoid if possible (though I am going to try both to compare anyway).

Thanks again for all your detailed responses, I really appreciate you taking the time to respond to my queries.