CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
29 stars 16 forks source link

WIP: single embeddings matrix design #140

Closed Adamits closed 2 months ago

Adamits commented 12 months ago

Throwing this WIP up to store all vocabs in a single embeddings matrix shared between source, target, and features. This will fix the current pointer-generator issues when we have disjoint vocabularies. I will get back to implementing it later this week.

Closes #156.

TODO

kylebgorman commented 12 months ago

I created issue #142 to document the plan here.

Adamits commented 5 months ago

I just rebased a ton of commits and want to make sure I didn't break anything.

EDIT: Hmmm I did not realize that all rebased commits would appear as changes in this PR> I have not resolved complicated git merges in a long time and thought rebasing was the right move, but I don't like how this looks...

Should I scrap this and merge master into my branch instead?

Edit Edit: I have done so, I think merging is better here.

kylebgorman commented 5 months ago

I'm not a merge vs. rebase N*zi but the way the PR looks when I review is plenty clear as is.

Adamits commented 5 months ago

Yeah sounds good.

vocab_map would be better just called index and we should throw out the SymbolMap class.

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

kylebgorman commented 5 months ago

Does this mean we would call it like index.index (consider that in dataset.py wew currently have e.g. self.index.vocab_map)?

You could make it so self.index could so be called, i.e., by perlocating that ability upward...if I'm understanding correctly.

Adamits commented 5 months ago

Ok I was working on replacing the indexing with one method as you suggested but there is an issue here:

our output space should be target_vocab_size. We could set this to the vocab_size (source + target + features), but this seems theoretically odd, since in many cases most of the output space is invalid, but we will still spill probability mass into it.

If we set it to be the a subset of the vocab (target_vocab_size), then we need to somehow map these output indices back to our vocabulary. Right now I was thinking about how this impacts turning ints back into symbols, but I think this also impacts decoder modules, which need to lookup the target in the shared embedding matrix.

Adamits commented 5 months ago

I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.

kylebgorman commented 5 months ago

Why not just make it the case that the target vocab are the first or last N elements of the vocabulary and then keep track of what N is? Then, initialize the output space to N.

Alternatively you can keep track of the target vocab separately from the omnibus/shared vocab.

On Tue, Apr 9, 2024 at 6:22 PM Adam @.***> wrote:

I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/pull/140#issuecomment-2046139272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKYBRL5XOPVCJ37B5LY4RS4DAVCNFSM6AAAAAA5P4WQOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBWGEZTSMRXGI . You are receiving this because you commented.Message ID: @.***>

Adamits commented 5 months ago

Still probably some clunkiness here, and I need to fix formatting. But I wanted to put this up in case you want to take a look:

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

kylebgorman commented 5 months ago

I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings is a bit awkward?

Look what we did for --log_wandb and --no_log_wandb here.

kylebgorman commented 5 months ago

Look what we did for --log_wandb and --no_log_wandb here.

Nevermind, that's not relevant. I was looking for a flag that defaults to True. Here's a better example.

Adamits commented 5 months ago

I have cleaned this up and tested it. Everything works except Transducer right now, which seems to be because of the target_vocab_size manipulation that happens in the model here https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/transducer.py#L43.

I think its because target_vocab_size gets updated AFTER we initialize the single embedding matrix. This can cause idnex out of bounds errors. I can't work on this until tonight or tomorrow. @bonham79 in the meantime if you have any intuition about a good fix for this please take a look!

Adamits commented 5 months ago

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

kylebgorman commented 5 months ago

And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(

pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)

Adamits commented 5 months ago

pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)

Ah thanks, looks like that worked.

Adamits commented 5 months ago

README should also be updated to reflect.

Ok will do. I think just --tied_embedings/--no_... right?

Would you say this simplifies the "symbol math" we have to go through, or is it worse?

Can you be more specific here? What is the symbol math?

It looks to me like you deleted the two embedding initialization static methods in the base model class, and then didn't replace them anywhere else.

I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py modeling files made it look confusing :D.

kylebgorman commented 5 months ago

Ok will do. I think just --tied_embedings/--no_... right?

Sure. I might create a section before the section on reserved symbols, call it Tied embeddings, and then say that by default we share embeddings between the source and target, but one can disable this with --no_tied_embeddings. That seems quite pithy.

Then you should also maybe mention the new class of reserved symbols in the following section: {...}.

Would you say this simplifies the "symbol math" we have to go through, or is it worse? Can you be more specific here?

Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?

I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py modeling files made it look confusing :D.

Roger that.

kylebgorman commented 5 months ago

It's not in your PR description but this will close #142 and I think this will close #156 too.

Adamits commented 5 months ago

I think I have cleaned this up per your comments.

Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?

I think we can argue that it is simpler since there is less to fix up after the index is created. However, this does introduce two new sources of complexity:

  1. Source tokens can now be wrapped in {...}---I cannot imagine a scenario where this impacts a user in a noticeable way beyond debugging, but it is worth noting that this type of implicit behavior could be confusing.
  2. We now need to track both the index subspace for feature vocabulary (which we did before) and for the index subspace of the target vocabulary.
kylebgorman commented 5 months ago

It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...

Adamits commented 5 months ago

It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...

Oh shoot, thank you for testing.

kylebgorman commented 5 months ago

Okay, I made a change to indexes.py so it no longer crashes if there are no features. This does seem to work on my Polish tests, but validation accuracy performance is quite poor (though non-zero) on Maori, for which we have no features. The best checkpoint gives me only 11% accuracy. Can you take a look at this (after my commit 38311f530438b1154f75e83102ef3a13f134c613) and see if you can replicate? LMK if you need the command.

kylebgorman commented 5 months ago

Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...

Adamits commented 5 months ago

Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

kylebgorman commented 5 months ago

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)

Adamits commented 5 months ago

I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).

Here are the first 3 val accs with the master branch:

0.7530120611190796
0.8885542154312134
0.9036144614219666

And with this PR code:

0.8177710771560669
0.8885542154312134
0.9066265225410461

There will be variance since embedding initialization is a function of the embedding matrix size.

Adamits commented 5 months ago

Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.

Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)

To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.

kylebgorman commented 5 months ago

I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).

Here are the first 3 val accs with the master branch:

0.7530120611190796
0.8885542154312134
0.9036144614219666

And with this PR code:

0.8177710771560669
0.8885542154312134
0.9066265225410461

There will be variance since embedding initialization is a function of the embedding matrix size.

I am getting similar results with that setup. Okay, false alarm on my part I guess. I may have installed from the wrong commit. Shall I merge?

kylebgorman commented 5 months ago

To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.

Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?

kylebgorman commented 5 months ago

Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?

readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly VAL="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --val "${VAL}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --source_encoder lstm \
    --features_encoder linear \
    --embedding_size 128 \
    --hidden_size 512 \
    --learning_rate .001 \
    --optimizer adam \
    --batch_size 32 \
    --max_epochs 60 \
    --dropout .3 \
    --seed 49 \
    --gradient_clip_val 3 \
    --precision 16 \
    --accelerator gpu
Adamits commented 5 months ago

Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?

On my mac intel CPU, master branch ran for ~37 epochs and did not go nan. I am running this branch now and so far after 25 epochs we have loss=0.015, val_loss=0.0968, val_accuracy=0.920. I am very curious what it could be. Could precision be handled differently on our devices?

Adamits commented 5 months ago

To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.

Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?

Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.

kylebgorman commented 5 months ago

Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?

On my mac intel CPU, master branch ran for ~37 epochs and did not go nan. I am running this branch now and so far after 25 epochs we have loss=0.015, val_loss=0.0968, val_accuracy=0.920. I am very curious what it could be. Could precision be handled differently on our devices?

That's my hypothesis, then. Okay. Let's not worry on that issue for now if it's not replicated on the second machine we try...we can wait to see if it is repeatedly reported.

kylebgorman commented 5 months ago

Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.

So I guess we could permit non-tieing for pointer-generator encoders using said tricks. Is that a feature you want to consider in the future or no?

kylebgorman commented 5 months ago

also should I merge this now?

Adamits commented 5 months ago

Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.

So I guess we could permit non-tieing for pointer-generator encoders using said tricks. Is that a feature you want to consider in the future or no?

Given that these results look good, probably not, but it is worth remembering just because one may want to do that for reproducibility reasons.

Adamits commented 5 months ago

also should I merge this now?

Maybe I will take the time today to run a fuller test on a handful more architectures first. Sound ok?

kylebgorman commented 5 months ago

Maybe I will take the time today to run a fuller test on a handful more architectures first. Sound ok?

Yeah do it, then if you don't have any contrary indications at EOB we'll merge.

Adamits commented 5 months ago

Another thought as I start running local tests: should we log something about the val for tied_embeddings and what it means? Or is relying on normal arg logging + the README good enough?

kylebgorman commented 5 months ago

Another thought as I start running local tests: should we log something about the val for tied_embeddings and what it means? Or is relying on normal arg logging + the README good enough?

I don't think this is more important than anything else we log by just showing people what args are being passed (including the default ones).

Adamits commented 5 months ago

Running the Transducer I get: IndexError: index out of range in self, which looks like it points to

File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
    embedded = self.embeddings(symbols)
kylebgorman commented 5 months ago

Running the Transducer I get: IndexError: index out of range in self, which looks like it points to

File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
    embedded = self.embeddings(symbols)

I can replicate. I am not sure if it was a regression associated with this PR or an earlier regression.

Adamits commented 5 months ago

Running the Transducer I get: IndexError: index out of range in self, which looks like it points to

File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
    embedded = self.embeddings(symbols)

I can replicate. I am not sure if it was a regression associated with this PR or an earlier regression.

Ok I will take a closer look

Adamits commented 5 months ago

I think it has to do with how the action vocabulary is created. I remember @bonham79 had to figure out some tricks here that I don't recall, so I probably need to update it to play nice with the new vocabulary.

EDIT: I might need some tips from @bonham79 -- I cannot make sense right now of how the action vocab is built.

bonham79 commented 4 months ago

@Adamits Sorry, notification for this fell into spam. Do you still need info for the action vocabulary?

Adamits commented 4 months ago

@Adamits Sorry, notification for this fell into spam. Do you still need info for the action vocabulary?

Hey, I forgot about this PR actually. Yeah I could use some help. This PR changes the vocab s.t. we store source and target vocab in a single embedding matrix---if we want separate embeddings, then the source vocab items are just wrapped in {...}. I was getting an index error when running the transducer that I think is caused by w/e happens after the initial vocabulary is built, where the transducer expands the action vocab.

Right now I cannot remember the details of what I had found so far, so maybe if you recll what the transducer does there and could give a high level overview I could jump back into this?

bonham79 commented 4 months ago

@Adamits Sure, so technically the transducer never uses the target vocab. The expert preprocesses the target vocab from train and creates a bunch of edit actions (like, straight Levenshtein edits) and passes that to the transducer as a prediction vocabulary. The target vocab only comes back into play once the actions are processed. So your fix would either

a). let the transducer do its own thing with the target actions or b). expand the source + target embedding space to include the action embeddings.

The change occurs here: https://github.com/CUNY-CL/yoyodyne/blob/c0640676adc8e25b3e41b5f7c574e7e115381a03/yoyodyne/models/transducer.py#L43 I just pass the expert's list of edit actions to the target vocab then initialize based off that.

Mind sharing the error you were getting?

Adamits commented 4 months ago

@bonham79 Ok thanks that is helpful!! I assume I initialized the embedding matrix before this change to target_vocab size. Anyway, here is the stack:

Traceback (most recent call last):
  File "/Users/adamwiemerslage/anaconda3/envs/py3-9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/adamwiemerslage/anaconda3/envs/py3-9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 399, in <module>
    main()
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 394, in main
    best_checkpoint = train(trainer, model, datamodule, args.train_from)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 261, in train
    trainer.fit(model, datamodule, ckpt_path=train_from)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
    self._run_sanity_check()
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
    val_loop.run()
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 530, in validation_step
    predictions, loss = self(batch)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 119, in forward
    prediction, loss = self.decode(
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 196, in decode
    decoder_output = self.decoder(
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/lstm.py", line 118, in forward
    embedded = self.embed(symbol)
  File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
    embedded = self.embeddings(symbols)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
bonham79 commented 3 months ago

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

Adamits commented 3 months ago

@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)

Right, is there a reason why the action vocabulary loops through the dataset instead of just getting a handle on the vocabulary? I think doing that would just fix this.