BagOfWordCountsTokenEmbedder & BasicClassifier not compatible

allenai / allennlp

An open-source NLP research library, built on PyTorch.

http://www.allennlp.org

Apache License 2.0

11.77k stars 2.25k forks source link

BagOfWordCountsTokenEmbedder & BasicClassifier not compatible #3388

Open camelop opened 5 years ago

camelop commented 5 years ago

Bug Description BagOfWordCountsTokenEmbedder returns tensor with shape [batchsize, size of vocab (or projection dim)], while most of other embedders return [batchsize, num of tokens, size of hidden] This is at some level inconsistent with the definition of embedder, so seq2vec in BasicClassifier cannot be applied (the input is not even a sequence).

To Reproduce Use them together like

"model": {
        "type": "basic_classifier",
        "text_field_embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "bag_of_word_counts"
                }
            }
        },
        "seq2vec_encoder": ... # any
    },

Expected behavior At least make the output of BagOfWordCountsTokenEmbedder looks like [bs, 1, vocab size] would help. (Although it will still be a bit confusing.)

System (please complete the following information):

OS: any (ubuntu 16.04)
Python version: 3.7
AllenNLP version: v0.9.0
PyTorch version: 1.3.0

brendan-ai2 commented 5 years ago

@kernelmachine , could you take a look at this? (As it looks like you're the author of the BagOfWordCountsTokenEmbedder.) @camelop's point seems like a good one, but I'm not sure what implications there are for existing code. Thanks!

matt-gardner commented 5 years ago

ping @kernelmachine

kernelmachine commented 5 years ago

Yes, I think unsqueezing the num_tokens dimension is a reasonable solution (it is a bit confusing, but we could make it clear in the documentation). The bag of words embedder is pretty isolated from the rest of the repo, so I don't think it'll be an issue. Contributions would be welcome!