Should we be making the vocab index from all paths?

CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Apache License 2.0

25 stars 15 forks source link

Should we be making the vocab index from all paths? #160

Closed Adamits closed 4 months ago

Adamits commented 4 months ago

See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/data/datamodules.py#L73

Consider the case that we want to test for how an OOV feature messes up the model for some reason. Then, if it exists in the provided dev/test set, it will be in the vocabulary. Does not seem like a big deal but is a bit unintuitive to me. On the other hand, a good assumption is that all vocab items/features in dev should already exist in train---then why do we need to add dev to the index?

Just wanted to open this to see if there is any justification for that.

kylebgorman commented 4 months ago

This means that you can have the same index-building code whether you're training or predicting and not have to think about it, and it also means you can do training and prediction in the same routine without rebuilding the dataclasses.

Adamits commented 4 months ago

I am not sure I follow. In what case could you not already have the same index-building code (by default I believe our code replaces OOVs with a special "UNK" token). That is, wouldn't we always expect that the train data exclusively builds the vocabulary that is used for prediction?

kylebgorman commented 4 months ago

I am not following why you'd want to change the current behavior either.

Adamits commented 4 months ago

I think it is reasonable to expect that the model only stores a vocabulary of what it has seen during training. Suppose I have 2 characters in the dev set that are not in the train set. Then I would typically expect my model to e.g. replace them with the same UNK embedding, but in our case, we will always have unique vocabulary items initialized for them.

This is not really important since we do not train the UNK token by default. It is more that this does not seem like a typical practice.

kylebgorman commented 4 months ago

I can imagine a setting where we auto-unk symbols below a frequency, but we don't have that yet. (Actually I think F*rsq has that feature...)

Adamits commented 4 months ago

For me, for example, I was trying to reproduce someone's result for a different codebase using yoyodyne. I noticed that my vocab_size was different than theirs and was not initially sure why.

Of course, many things make reproduction hard (e.g. the gpu kernel, the order in which things are initialized, whether you query the dataloader before doing any other random operations, etc.).

Adamits commented 4 months ago

Gonna close this and open a low-priority feature request about auto-unking.

Adamits commented 4 months ago

This has been implemented in #163