Closed Adamits closed 2 months ago
I created issue #142 to document the plan here.
I just rebased a ton of commits and want to make sure I didn't break anything.
EDIT: Hmmm I did not realize that all rebased commits would appear as changes in this PR> I have not resolved complicated git merges in a long time and thought rebasing was the right move, but I don't like how this looks...
Should I scrap this and merge master into my branch instead?
Edit Edit: I have done so, I think merging is better here.
I'm not a merge vs. rebase N*zi but the way the PR looks when I review is plenty clear as is.
Yeah sounds good.
vocab_map would be better just called index and we should throw out the SymbolMap class.
Does this mean we would call it like index.index
(consider that in dataset.py wew currently have e.g. self.index.vocab_map
)?
Does this mean we would call it like
index.index
(consider that in dataset.py wew currently have e.g.self.index.vocab_map
)?
You could make it so self.index
could so be called, i.e., by perlocating that ability upward...if I'm understanding correctly.
Ok I was working on replacing the indexing with one method as you suggested but there is an issue here:
our output space should be target_vocab_size
. We could set this to the vocab_size (source + target + features), but this seems theoretically odd, since in many cases most of the output space is invalid, but we will still spill probability mass into it.
If we set it to be the a subset of the vocab (target_vocab_size
), then we need to somehow map these output indices back to our vocabulary. Right now I was thinking about how this impacts turning ints back into symbols, but I think this also impacts decoder modules, which need to lookup the target in the shared embedding matrix.
I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.
Why not just make it the case that the target vocab are the first or last N elements of the vocabulary and then keep track of what N is? Then, initialize the output space to N.
Alternatively you can keep track of the target vocab separately from the omnibus/shared vocab.
On Tue, Apr 9, 2024 at 6:22 PM Adam @.***> wrote:
I think I can make this work with some tricky offsets to map target indices drawn from {0, ..., target_size} to the embedding matrix indices of {0, ..., source+target+features size}. However, this will require providing this info to the model so we can offset predictions when decoding. This seems a bit awkward to me.
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/pull/140#issuecomment-2046139272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKYBRL5XOPVCJ37B5LY4RS4DAVCNFSM6AAAAAA5P4WQOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBWGEZTSMRXGI . You are receiving this because you commented.Message ID: @.***>
Still probably some clunkiness here, and I need to fix formatting. But I wanted to put this up in case you want to take a look:
I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set --tie_embeddings
as the default (e.g. defaults to true). Though I guess adding a flag like --no_tie_embeddings
is a bit awkward?
I have tested for Attentive LSTM and pg LSTM, and it solves the indexing issue Michael has. When I come back to this tomorrow I also want to set
--tie_embeddings
as the default (e.g. defaults to true). Though I guess adding a flag like--no_tie_embeddings
is a bit awkward?
Look what we did for --log_wandb
and --no_log_wandb
here.
Look what we did for
--log_wandb
and--no_log_wandb
here.
Nevermind, that's not relevant. I was looking for a flag that defaults to True
. Here's a better example.
I have cleaned this up and tested it. Everything works except Transducer right now, which seems to be because of the target_vocab_size manipulation that happens in the model here https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/transducer.py#L43.
I think its because target_vocab_size
gets updated AFTER we initialize the single embedding matrix. This can cause idnex out of bounds errors. I can't work on this until tonight or tomorrow. @bonham79 in the meantime if you have any intuition about a good fix for this please take a look!
And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(
And, as usual, black is failing the test but passing locally. Sorry I can never remember what the culprit is :(
pip install -r requirements.txt
ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)
pip install -r requirements.txt ought to set you to the exact version to use. (They have been making changes to the generated format...and I recently upgraded it because there was a security issue in a dependency.)
Ah thanks, looks like that worked.
README should also be updated to reflect.
Ok will do. I think just --tied_embedings/--no_...
right?
Would you say this simplifies the "symbol math" we have to go through, or is it worse?
Can you be more specific here? What is the symbol math?
It looks to me like you deleted the two embedding initialization static methods in the base model class, and then didn't replace them anywhere else.
I move them from the base module to the base model to follow the design pattern we discussed. I think the two base.py
modeling files made it look confusing :D.
Ok will do. I think just
--tied_embedings/--no_...
right?
Sure. I might create a section before the section on reserved symbols, call it Tied embeddings
, and then say that by default we share embeddings between the source and target, but one can disable this with --no_tied_embeddings
. That seems quite pithy.
Then you should also maybe mention the new class of reserved symbols in the following section: {...}
.
Would you say this simplifies the "symbol math" we have to go through, or is it worse? Can you be more specific here?
Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?
I move them from the base module to the base model to follow the design pattern we discussed. I think the two
base.py
modeling files made it look confusing :D.
Roger that.
It's not in your PR description but this will close #142 and I think this will close #156 too.
I think I have cleaned this up per your comments.
Is the way we build the index simpler (ignoring the additional complexity from tying the embeddings) or more complicated, nor about the same?
I think we can argue that it is simpler since there is less to fix up after the index is created. However, this does introduce two new sources of complexity:
{...}
---I cannot imagine a scenario where this impacts a user in a noticeable way beyond debugging, but it is worth noting that this type of implicit behavior could be confusing.It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...
It looks like we need to test if there are features, since I'm getting a breakage if there's not. I'll just modify to reflect...
Oh shoot, thank you for testing.
Okay, I made a change to indexes.py
so it no longer crashes if there are no features. This does seem to work on my Polish tests, but validation accuracy performance is quite poor (though non-zero) on Maori, for which we have no features. The best checkpoint gives me only 11% accuracy. Can you take a look at this (after my commit 38311f530438b1154f75e83102ef3a13f134c613) and see if you can replicate? LMK if you need the command.
Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...
Enabling this may also be resulting in slightly poorer performance on Polish with a pointer-generator LSTM, just from eyeballing...
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)
I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).
Here are the first 3 val accs with the master branch:
0.7530120611190796
0.8885542154312134
0.9036144614219666
And with this PR code:
0.8177710771560669
0.8885542154312134
0.9066265225410461
There will be variance since embedding initialization is a function of the embedding matrix size.
Ok I will do more testing. I also just remembered a potential issue: this removes any mechanism to get a pointer-generator without shared embeddings, which was previously the default. That is because our code requires the same symbol on the source or target side to share the same index. But now, if we want this behavior, then they also must point to the same embedding. To support this, id need to change the implementation of how we combine the pointer and generator distributions.
Yes, to confirm by default the embedding is shared and that ought to just work with pointer-generator architectures (which require that the embedding be shared, during training). I don't see the problem, since you did that already. That said, it may not be working as intended with pointer-generators. (I get basically the same results with or without tied embeddings on my LSTM experiments, though.)
To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.
I trained on the polish abstractness data with features (pointer_generator_lstm, lstm encoder, linear features encoder, hidden size 512, 1 layer, emb 128).
Here are the first 3 val accs with the master branch:
0.7530120611190796 0.8885542154312134 0.9036144614219666
And with this PR code:
0.8177710771560669 0.8885542154312134 0.9066265225410461
There will be variance since embedding initialization is a function of the embedding matrix size.
I am getting similar results with that setup. Okay, false alarm on my part I guess. I may have installed from the wrong commit. Shall I merge?
To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.
Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?
Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"
readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly VAL="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"
yoyodyne-train \
--experiment "${LANGUAGE}" \
--train "${TRAIN}" \
--val "${VAL}" \
--model_dir "${MODEL_DIR}" \
--features_col 3 \
--arch "${ARCH}" \
--source_encoder lstm \
--features_encoder linear \
--embedding_size 128 \
--hidden_size 512 \
--learning_rate .001 \
--optimizer adam \
--batch_size 32 \
--max_epochs 60 \
--dropout .3 \
--seed 49 \
--gradient_clip_val 3 \
--precision 16 \
--accelerator gpu
Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?
On my mac intel CPU, master branch ran for ~37 epochs and did not go nan. I am running this branch now and so far after 25 epochs we have loss=0.015, val_loss=0.0968, val_accuracy=0.920
. I am very curious what it could be. Could precision be handled differently on our devices?
To be clear, we did not previously default to sharing embeddings, we just ensured characters had the same index, but those indices pointed to separate matrices. I think this is fine, I just mean that we no longer support disjoint embeddings for pointer generators since they require the src/tgt vocab index to be the same, but currently disjoint embeddings by necessity entail seperate src/tgt vocab indices.
Okay, so we now require shared embeddings for pointer-generators. Do we have to? Could we allow disjoint embeddings if the indexes match up?
Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.
Sorry to keep jumping around with this, but I let this PR's version keep training on Polish and it went nan around 20 epochs. See if you can replicate?
On my mac intel CPU, master branch ran for ~37 epochs and did not go nan. I am running this branch now and so far after 25 epochs we have
loss=0.015, val_loss=0.0968, val_accuracy=0.920
. I am very curious what it could be. Could precision be handled differently on our devices?
That's my hypothesis, then. Okay. Let's not worry on that issue for now if it's not replicated on the second machine we try...we can wait to see if it is repeatedly reported.
Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.
So I guess we could permit non-tieing for pointer-generator encoders using said tricks. Is that a feature you want to consider in the future or no?
also should I merge this now?
Yes, but simulating the indexes matching up will require tricks at runtime somehow, since in a single embedding matrix we need a unique index per symbol.
So I guess we could permit non-tieing for pointer-generator encoders using said tricks. Is that a feature you want to consider in the future or no?
Given that these results look good, probably not, but it is worth remembering just because one may want to do that for reproducibility reasons.
also should I merge this now?
Maybe I will take the time today to run a fuller test on a handful more architectures first. Sound ok?
Maybe I will take the time today to run a fuller test on a handful more architectures first. Sound ok?
Yeah do it, then if you don't have any contrary indications at EOB we'll merge.
Another thought as I start running local tests: should we log something about the val for tied_embeddings and what it means? Or is relying on normal arg logging + the README good enough?
Another thought as I start running local tests: should we log something about the val for tied_embeddings and what it means? Or is relying on normal arg logging + the README good enough?
I don't think this is more important than anything else we log by just showing people what args are being passed (including the default ones).
Running the Transducer I get: IndexError: index out of range in self
, which looks like it points to
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
embedded = self.embeddings(symbols)
Running the Transducer I get:
IndexError: index out of range in self
, which looks like it points toFile "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed embedded = self.embeddings(symbols)
I can replicate. I am not sure if it was a regression associated with this PR or an earlier regression.
Running the Transducer I get:
IndexError: index out of range in self
, which looks like it points toFile "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed embedded = self.embeddings(symbols)
I can replicate. I am not sure if it was a regression associated with this PR or an earlier regression.
Ok I will take a closer look
I think it has to do with how the action vocabulary is created. I remember @bonham79 had to figure out some tricks here that I don't recall, so I probably need to update it to play nice with the new vocabulary.
EDIT: I might need some tips from @bonham79 -- I cannot make sense right now of how the action vocab is built.
@Adamits Sorry, notification for this fell into spam. Do you still need info for the action vocabulary?
@Adamits Sorry, notification for this fell into spam. Do you still need info for the action vocabulary?
Hey, I forgot about this PR actually. Yeah I could use some help. This PR changes the vocab s.t. we store source and target vocab in a single embedding matrix---if we want separate embeddings, then the source vocab items are just wrapped in {...}
. I was getting an index error when running the transducer that I think is caused by w/e happens after the initial vocabulary is built, where the transducer expands the action vocab.
Right now I cannot remember the details of what I had found so far, so maybe if you recll what the transducer does there and could give a high level overview I could jump back into this?
@Adamits Sure, so technically the transducer never uses the target vocab. The expert preprocesses the target vocab from train and creates a bunch of edit actions
(like, straight Levenshtein edits) and passes that to the transducer as a prediction vocabulary. The target vocab only comes back into play once the actions are processed. So your fix would either
a). let the transducer do its own thing with the target actions or b). expand the source + target embedding space to include the action embeddings.
The change occurs here: https://github.com/CUNY-CL/yoyodyne/blob/c0640676adc8e25b3e41b5f7c574e7e115381a03/yoyodyne/models/transducer.py#L43 I just pass the expert's list of edit actions to the target vocab then initialize based off that.
Mind sharing the error you were getting?
@bonham79 Ok thanks that is helpful!! I assume I initialized the embedding matrix before this change to target_vocab size. Anyway, here is the stack:
Traceback (most recent call last):
File "/Users/adamwiemerslage/anaconda3/envs/py3-9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/adamwiemerslage/anaconda3/envs/py3-9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 399, in <module>
main()
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 394, in main
best_checkpoint = train(trainer, model, datamodule, args.train_from)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/train.py", line 261, in train
trainer.fit(model, datamodule, ckpt_path=train_from)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
self._run_sanity_check()
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
val_loop.run()
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 530, in validation_step
predictions, loss = self(batch)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 119, in forward
prediction, loss = self.decode(
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/transducer.py", line 196, in decode
decoder_output = self.decoder(
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/lstm.py", line 118, in forward
embedded = self.embed(symbol)
File "/Users/adamwiemerslage/nlp-projects/morphology/yoyodyne/yoyodyne/models/modules/base.py", line 85, in embed
embedded = self.embeddings(symbols)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/Users/adamwiemerslage/python-envs/yoyodyne/lib/python3.9/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)
@Adamits Looking through your changes, it seems you never extend the transducer's vocab size with the additional target vocab size. The target vocab can't be acquired through the vocab map since it's edit actions from the expert. (It's like, every potential target vocab has a copy or insert action, so it's 2x the observed vocabulary.)
Right, is there a reason why the action vocabulary loops through the dataset instead of just getting a handle on the vocabulary? I think doing that would just fix this.
Throwing this WIP up to store all vocabs in a single embeddings matrix shared between source, target, and features. This will fix the current pointer-generator issues when we have disjoint vocabularies. I will get back to implementing it later this week.
vocab_map
target_vocabulary
andsource_vocabulary
for logging.Closes #156.
TODO