emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Fixed AttentionalDecoder truncating to 20 chars #85

Open PrinsINT opened 11 months ago

PrinsINT commented 11 months ago

When labels for a task are longer than 20 characters, they would be decoded as truncated to 20 characters. So instead we first look at the length of the longest known token.

Before this fix, we would get extremely low test scores for datasets with e.g. long part of speech tags: after all, truncated labels don't match the original ones.

73 / #74 almost completely rewrites AttentionalDecoder.predict_max() for a speed up. However, it still contains the same bug mentioned here. Notably, our fix here greatly slows down predict_max() for long labels. For example, I went from 2min44s to 11m00s trying to evaluate the same set & model before and after this fix.

Therefore, I'm pinging @PonteIneptique as well, as I'm not sure how this fix would work in the improved predict_max().

Naively inserting the fix at line 344 and removing line 424 throws an error:

Traceback (most recent call last):
  File "/pie/pie/evaluate.py", line 84, in <module>
    run(model_path=args.model_path, test_path=args.test_path,
  File "/pie/pie/evaluate.py", line 56, in run
    for task in model.evaluate(testset, trainset,
  File "/pie/pie/pie/models/base_model.py", line 83, in evaluate
    preds = self.predict(inp, **kwargs)
  File "/pie/pie/pie/models/model.py", line 314, in predict
    hyps, prob = decoder.predict_max(
  File "/pie/pie/pie/models/decoder.py", line 362, in predict_max
    outs, _ = self.attn(outs, enc_outs, lengths)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pie/pie/pie/models/attention.py", line 164, in forward
    mask = mask.unsqueeze(0).expand_as(weights)
RuntimeError: The expanded size of the tensor (20) must match the existing size (43) at non-singleton dimension 2.  Target sizes: [1, 12576, 20].  Tensor sizes: [1, 12576, 43]
PonteIneptique commented 11 months ago

Hey :) I continued developing Pie on the PaPie repository. Given https://github.com/emanjavacas/pie/issues/84#issuecomment-1647598500, I don't know if @emanjavacas is gonna support Pie more (I'll let him reply :D ) As for the fix, while the issue you note is important, I am a bit puzzled by this sentence in the issue that is fixed:

Before this fix, we would get extremely low test scores for datasets with e.g. long part of speech tags

Normally, PartOfSpeechTags are categorical, not generated. Can you provide your configuration ?

As for the lemmas though, you are right it might be an issue, I'll check that next week (I am going on vacation tonight)

emanjavacas commented 11 months ago

Hi both,

We could just increase the default max_seq_len to accommodate. However, this decoder is only meant for tasks like lemmatization. It's hard to imagine a language with lemmas longer than 20, but still we could enlarge the default max_seq_len. If you are using this decoder for POS tagging, that's something I cannot see right now the use for it. And yeah POS labels are categorical anyway, you could always map them to integers or use whatever coding scheme.

I am not going to commit to great changes in PIE, but I think it's still quite robust and should meet the demands of many. I tested it on pytorch 2.0 and it passes all tests. But I don't want to change it in a way that makes it harder to maintain, which is probably the reason why Thibault (hi Thibault!) forked it.

On Fri, Oct 27, 2023 at 11:49 AM Thibault Clérice @.***> wrote:

Hey :) I continued developing Pie on the PaPie repository. Given #84 (comment) https://github.com/emanjavacas/pie/issues/84#issuecomment-1647598500, I don't know if @emanjavacas https://github.com/emanjavacas is gonna support Pie more (I'll let him reply :D ) As for the fix, while the issue you note is important, I am a bit puzzled by this sentence in the issue that is fixed:

Before this fix, we would get extremely low test scores for datasets with e.g. long part of speech tags

Normally, PartOfSpeechTags are categorical, not generated. Can you provide your configuration ?

As for the lemmas though, you are right it might be an issue, I'll check that next week (I am going on vacation tonight)

— Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/pull/85#issuecomment-1782625390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPIYPTPGM2K355CZYEN3YBN7TTAVCNFSM6AAAAAA6SRI7ZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBSGYZDKMZZGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Enrique Manjavacas

PrinsINT commented 11 months ago

The config file used had the tasks defined as:

    "tasks": [
        {
            "name": "lemma",
            "target": true,
            "context": "sentence",
            "level": "char",
            "decoder": "attentional",
            "settings": {
                "bos": true,
                "eos": true,
                "lower": true,
                "target": "lemma"
            },
            "layer": -1
        },
        {
            "name": "pos",
            "target": false,
            "context": "sentence",
            "level": "char",
            "decoder": "attentional",
            "settings": {
                "bos": true,
                "eos": true,
                "lower": true,
                "target": "pos"
            },
            "layer": -1
        }
    ],

I'm guessing the 'pos' task should have been defined at token level, not char level? And I suppose then that you should use a linear or crf encoder, instead of attentional?

emanjavacas commented 11 months ago

Yup, don't use the "attentional" decoder for POS, use "linear" as shown in the example config for PoS https://github.com/emanjavacas/pie/blob/master/configs/pos.json#L12

On Fri, Oct 27, 2023 at 12:02 PM Vincent Prins @.***> wrote:

The config file used had the tasks defined as:

"tasks": [
    {
        "name": "lemma",
        "target": true,
        "context": "sentence",
        "level": "char",
        "decoder": "attentional",
        "settings": {
            "bos": true,
            "eos": true,
            "lower": true,
            "target": "lemma"
        },
        "layer": -1
    },
    {
        "name": "pos",
        "target": false,
        "context": "sentence",
        "level": "char",
        "decoder": "attentional",
        "settings": {
            "bos": true,
            "eos": true,
            "lower": true,
            "target": "pos"
        },
        "layer": -1
    }
],

I'm guessing the 'pos' task should have been defined at token level, not char level? And I suppose then that you should use a linear or crf encoder, instead of attentional?

— Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/pull/85#issuecomment-1782642936, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI7SOANUCQ7OAV6HYG3YBOBCTAVCNFSM6AAAAAA6SRI7ZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBSGY2DEOJTGY . You are receiving this because you were mentioned.Message ID: @.***>

-- Enrique Manjavacas