allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.77k stars 2.25k forks source link

Runtime error with convolution classifier #5384

Closed david-waterworth closed 3 years ago

david-waterworth commented 3 years ago

I've created a text classifier based on the cnn-highway encoder. It more or less works, but sometimes it crashes either during training or evaluation, the problem being that sometimes the input text isn't long enough for the CNN window size. It's mostly a problem during evaluation where I've been evaluating single strings, during training the batching tends to add enough padding to mask the issue.

The error is:

RuntimeError: Calculated padded input size per channel: (3). Kernel size: (4). Kernel size can't be greater than actual input size

The model is below, how do you pad each input sequence to a minimum length so the model doesn't occasionally fail?

{
    vocabulary: {
        type: "from_instances",
        non_padded_namespaces: [],
    },
    train_data_path: TRAIN_PATH,
    validation_data_path: DEV_PATH,
    dataset_reader: {
        type: "text_classification_json",
        tokenizer: {
          type: "character",
          lowercase_characters: false
        },
        token_indexers: {
            tokens: {
                type: "single_id",
            },
        }
    },
    model: {
        type: "basic_classifier",
        text_field_embedder: {
            token_embedders: {
                tokens: {
                    type: "embedding",
                    embedding_dim: EMBEDDING_DIM,
                },
            }
        },
        seq2vec_encoder: {
            # https://arxiv.org/pdf/1508.06615.pdf (Small)
            type: "cnn-highway",
            embedding_dim: EMBEDDING_DIM,
            filters: [[1,25],[2,50],[3,125],[4,150],[5,175],[6,200]],
            num_highway: 1,
            projection_dim: 200
        }
    },
    data_loader: {
        batch_size: BATCH_SIZE,
        shuffle: true
    },
    trainer: {
      cuda_device: CUDA_DEVICE,
      num_epochs: NUM_EPOCHS,
      optimizer: {
         lr: LEARNING_RATE,
         type: "adam"
      },
      patience: 1,
      validation_metric: "+accuracy"
   }
}
AkshitaB commented 3 years ago

@david-waterworth There is a field for max_sequence_length in the text_classification_json reader, but not a min_sequence_length. I think the simplest fix will be to add a field like that in the dataset reader (or create a new dataset reader), and deal with the padding accordingly.

github-actions[bot] commented 3 years ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

david-waterworth commented 3 years ago

@AkshitaB yes that should work fine, provided the tokeniser api allows you to obtain the padding token?