Dataloader hot fix - Githubissues

bonham79 commented 4 months ago

Found reason for https://github.com/CUNY-CL/yoyodyne/issues/157 necessary change for https://github.com/CUNY-CL/yoyodyne/issues/161

Fixes for transducer.

Transducer has an argument in training that's unnecessary. Removing that.
Transducer will throw an error if passed an unknown action. Because of current datamodule setup, we encode vocabulary in both dev and training data. As such, the module is given the impression that there are more potential edit actions than the expert is provided during sed training. This causes transducer failure. Switching the datamodule setup to only instantiate from training set (for reasons argued profusely in thread).

For redundancy, I also added a logger for when actions outside training set are passed to transducer expert. This really should cause a failure, but it's rather annoying to have your model die midway through an epoch when the only issue was a single OOV. This way just bothers you with hundreds of logs.

Note for change number 2, this allows forward progress for https://github.com/CUNY-CL/yoyodyne/issues/161 as the lack of unk tokens in dev meant that trainign UNK tokens would not be evaluated until test time.

kylebgorman commented 4 months ago

I still don't understand why this is wrong, or how it could affect results during validation (the token embedding isn't trained either way). I did this the way it is because this also eliminates the need to load and reload the index if you're doing both training and prediction in a single process---and it's what the datamodule is supposed to be like too. Maybe you or @Adamits can explain it to me again?

I'm fine with the transducer fix of course.

bonham79 commented 4 months ago

Few reasons:

Formally, you want your model to have little (as in none) input from the dev set. This is nitpicky here but given that our main use is academic and students are starting to use this library, we want to stick as close to the formal as possible.
In certain cases, the intrusion of dev vocabulary is not simply a formal concern. The reason for the transducer bug is that the expert would explore the dev vocabulary while instructing the base model. For future additions to the repo, this leaves open a liability of the vocabulary. Any form of string transduction that allows exploration of potential character transductions paths - which is rather prominent in the literature - will encounter issues.
It technically complicates the learning process as now you are introducing characters in the classifier that are duds. While they won't be seen as targets in training, they will be given near zero weights over the softmax function. It also expands the number of marginals over which the softmax calculates. For a few characters this is okayish, but if there are glaring disjunctions between dev and train, this is going to affect training. More nitpicky, but this is a library for low-resource tasks, so I don't think wasting training flops is best for design.
It makes the target vocabulary inconsistent. Any model trying to use the target vocabulary has to remember that the target vocabulary is inclusive of all splits. So any desire to limit to just train vocabulary requires a few extra steps for every model. Meanwhile, if you can assume just train vocab in index, you will never have a case where you need to add extra steps (there should be no reason why your model has to add functionality for dev).
It's redundant. This is why UNK token exists.
Makes error analysis a pain. Since we're including all split vocab, that means the UNK predictions of dev, predict, or test scripts will be a pain since that will have equivalent probability of all the characters outside training. If we want to do https://github.com/CUNY-CL/yoyodyne/issues/161 and have results be reliable, we need to restrict OOV. That's done by only using train vocab.

tldr; in most cases it won't cause issues, but when it causes issues they'll be painful. If we do it the other way we'll never have issues.

kylebgorman commented 4 months ago

Formally, you want your model to have little (as in none) input from the dev set. This is nitpicky here but given that our main use is academic and students are starting to use this library, we want to stick as close to the formal as possible.

There is no relevant sense in which in which the model is therefore "seeing the dev set". I'm sorry but that's irrelevant.

In certain cases, the intrusion of dev vocabulary is not simply a formal concern. The reason for the transducer bug is that the expert would explore the dev vocabulary while instructing the base model. For future additions to the repo, this leaves open a liability of the vocabulary. Any form of string transduction that allows exploration of potential character transductions paths - which is rather prominent in the literature - will encounter issues.

Okay, that could be relevant. Can you show us a toy case where that's happening? (I would like to see exactly what "move" or "rule" it's seeing.) My intuition is that that should be impossible because the EM procedure, which only sees train, should never be able to reinforce them.

It technically complicates the learning process as now you are introducing characters in the classifier that are duds. While they won't be seen as targets in training, they will be given near zero weights over the softmax function. It also expands the number of marginals over which the softmax calculates. For a few characters this is okayish, but if there are glaring disjunctions between dev and train, this is going to affect training. More nitpicky, but this is a library for low-resource tasks, so I don't think wasting training flops is best for design.

That may be an argument here, but a data set which had this glaring disjunction would be incoherent anyways. I'd like to see a simple profiling result on a non-contrived example. Our reasoning about flops in a GPU setup like this is probably not great.

It makes the target vocabulary inconsistent. Any model trying to use the target vocabulary has to remember that the target vocabulary is inclusive of all splits. So any desire to limit to just train vocabulary requires a few extra steps for every model. Meanwhile, if you can assume just train vocab in index, you will never have a case where you need to add extra steps (there should be no reason why your model has to add functionality for dev).

We used to have a lot of code to ensure the vocabulary was consistent on reload. I found that caused problems and could be done away with.

It's redundant. This is why UNK token exists.

Well, only if we train it, which we don't yet.

Makes error analysis a pain. Since we're including all split vocab, that means the UNK predictions of dev, predict, or test scripts will be a pain since that will have equivalent probability of all the characters outside training. If we want to do Training UNK tokens #161 and have results be reliable, we need to restrict OOV. That's done by only using train vocab.

I would be fine with something that actually trained UNKs, of course.

To be clear, I'm not trying to be truculent or prevent this, I just don't find these compelling yet. (2) would be convincing if we know for a fact that it's happening (I'm not sure yet, should be easy to show), or (3) if we have a profiling result, or (5-6) once #161 is implemented (and then it's a question whether you do it always or just if the feature is enabled).

Quick note about the implementation: If you're going ahead with this, shouldn't you kill the paths variable in the data module? Is it used for anything else?

bonham79 commented 4 months ago

There is no relevant sense in which in which the model is therefore "seeing the dev set". I'm sorry but that's irrelevant.

The model is using a vocabulary garnered from the dev set. This expands the decoder's vocabulary. This is formally changing the graph in accordance to dev set. Definitely relevant from a formal perspective.

Okay, that could be relevant. Can you show us a toy case where that's happening? (I would like to see exactly what "move" or "rule" it's seeing.) My intuition is that that should be impossible because the EM procedure, which only sees train, should never be able to reinforce them.

Here's how it's breaking the transducer.:

Index encodes vocabulary. Includes dev vocabulary. So a train vocabulary of like 60 and a dev vocabulary with one extra character gives index range of [0,60] where just the train would give [0,59]
When passed to expert, only edit actions of train are encoded. Let's say index 5 is the character only seen in dev set. This means that the expert will see indexes [0,5) [6,60]
Transducer model is instantiated. We instantiate based off vocab size. Since our vocabulary is size 61, the classifier head has predictions [0-60].
Exploration occurs, model thinks it has options of [0,60], it tries out an edit action with index 5, this was never seen by expert. Expert throws error because this action doesn't exit in train set, model thought it did because its classifier head is [0-60].

What are our options? Well I can change expert behavior to just replace a false edit action with a random one (done here), but that shouldn't happen and so still needs to be solved.

We can also shift all actions when they are incongruous with the vocab. This would require monitoring what are train characters and what are dev characters and then adding extra logic to only allow actions for train. At that point, we should just make the target characters the train characters anyhow.

That may be an argument here, but a data set which had this glaring disjunction would be incoherent anyways. I'd like to see a simple profiling result on a non-contrived example. Our reasoning about flops in a GPU setup like this is probably not great.

It's my weaker argument. Doesn't really show up on graph.

We used to have a lot of code to ensure the vocabulary was consistent on reload. I found that caused problems and could be done away with.

I just mean consistent as: one should know exactly where their model's vocabulary comes from. Just coming from your train manifest is reliable and the default assumption across experiments.

It's redundant. This is why UNK token exists.

Well, only if we train it, which we don't yet.

No, even without training, the UNK token exists precisely to catch OOV mismatch between train and other datasets. We should be seeing UNK in all cases where the dev, predict, and test do not overlap with train. When I was troubleshooting this, I had to rife through indexes to find a simple grave character. If UNK was being used instead, I could reliably just search for the index on a debug.

Makes error analysis a pain. Since we're including all split vocab, that means the UNK predictions of dev, predict, or test scripts will be a pain since that will have equivalent probability of all the characters outside training. If we want to do Training UNK tokens #161 and have results be reliable, we need to restrict OOV. That's done by only using train vocab.

I would be fine with something that actually trained UNKs, of course.

That's fine, but this setup won't even let us evaluate UNK reliably because you're adding into the vocab other junk characters.

To be clear, I'm not trying to be truculent or prevent this, I just don't find these compelling yet. (2) would be convincing if we know for a fact that it's happening (I'm not sure yet, should be easy to show), or (3) if we have a profiling result, or (5-6) once #161 is implemented (and then it's a question whether you do it always or just if the feature is enabled).

No I get you, my point is that the current way of doing has no justification, so we should do it this way because it's the null case. It's small stuff, but introduces additional considerations that creates later tech debt. Better to do it the vanilla way instead.

Quick note about the implementation: If you're going ahead with this, shouldn't you kill the paths variable in the data module? Is it used for anything else?

Think this is your contribution so that'd be your decision. I was more focused on getting rid of the bug. I'd prefer not having paths as vars though because it allows portability without revealing people's directory structure.

kylebgorman commented 4 months ago

There is no relevant sense in which in which the model is therefore "seeing the dev set". I'm sorry but that's irrelevant.

The model is using a vocabulary garnered from the dev set. This expands the decoder's vocabulary. This is formally changing the graph in accordance to dev set. Definitely relevant from a formal perspective.

Okay, that could be relevant. Can you show us a toy case where that's happening? (I would like to see exactly what "move" or "rule" it's seeing.) My intuition is that that should be impossible because the EM procedure, which only sees train, should never be able to reinforce them.

Here's how it's breaking the transducer.:

Index encodes vocabulary. Includes dev vocabulary. So a train vocabulary of like 60 and a dev vocabulary with one extra character gives index range of [0,60] where just the train would give [0,59]

When passed to expert, only edit actions of train are encoded. Let's say index 5 is the character only seen in dev set. This means that the expert will see indexes [0,5) [6,60]

Transducer model is instantiated. We instantiate based off vocab size. Since our vocabulary is size 61, the classifier head has predictions [0-60].

Exploration occurs, model thinks it has options of [0,60], it tries out an edit action with index 5, this was never seen by expert. Expert throws error because this action doesn't exit in train set, model thought it did because its classifier head is [0-60].

But shouldn't EM drive that option out to zero anyways? The exploration should never sample this. A snippet would help if I'm wrong though. Sorry to be difficult.

No, even without training, the UNK token exists precisely to catch OOV mismatch between train and other datasets. We should be seeing UNK in all cases where the dev, predict, and test do not overlap with train. When I was troubleshooting this, I had to rife through indexes to find a simple grave character. If UNK was being used instead, I could reliably just search for the index on a debug.

I don't believe this works yet nor that this fix will cause it to work. There's no mechanism in place to ever cause something to be encoded as UNK, as far as I know.

Quick note about the implementation: If you're going ahead with this, shouldn't you kill the paths variable in the data module? Is it used for anything else? Think this is your contribution so that'd be your decision. I was more focused on getting rid of the bug. I'd prefer not having paths as vars though because it allows portability without revealing people's directory structure.

I didn't see how to implement a data module nicely without storing some paths. But I think this fix should get rid of any technical debt in the data module related to the change, so if you have time, take a quick look at my design and do away with the extra complexity there.

bonham79 commented 4 months ago

But shouldn't EM drive that option out to zero anyways? The exploration should never sample this. A snippet would help if I'm wrong though. Sorry to be difficult.


                if self.expert.explore():
                    # Action is picked by random exploration.
                    next_action = [
                        self.sample(probs) if nc else self.actions.end_idx
                        for probs, nc in zip(log_probs, not_complete)

During early training, the model is allowed to randomly sample actions along log-probs. (It's a RL thing, apparently you need the pupil to have some early randomness for optimality.) When it does this it ignores the EM.

I don't believe this works yet nor that this fix will cause it to work. There's no mechanism in place to ever cause something to be encoded as UNK, as far as I know.

We pass target-vocab-size to classifiers. These include the UNK character in the count. Since BOS and EOS are learned in sequence models, this makes the index for UNK the only option when predicted OOV characters. Also our symbol maps can tokenizer OOV characters as UNK when tokenizing. So it's a quick check by just running dev through datamodule and seeing what gets converted to UNK.

I didn't see how to implement a data module nicely without storing some paths. But I think this fix should get rid of any technical debt in the data module related to the change, so if you have time, take a quick look at my design and do away with the extra complexity there.

Kk, can do.

kylebgorman commented 4 months ago

''' if self.expert.explore(): # Action is picked by random exploration. next_action = [ self.sample(probs) if nc else self.actions.end_idx for probs, nc in zip(log_probs, not_complete) '''

During early training, the model is allowed to randomly sample actions along log-probs. (It's a RL thing, apparently you need the pupil to have some early randomness for optimality.) When it does this it ignores the EM.

You're saying it ignores probs?

We pass target-vocab-size to classifiers. These include the UNK character in the count. Since BOS and EOS are learned in sequence models, this makes the index for UNK the only option when predicted OOV characters. Also our symbol maps can tokenizer OOV characters as UNK when tokenizing. So it's a quick check by just running dev through datamodule and seeing what gets converted to UNK.

I still don't think that's sufficient for an UNK to be generated even if it's formally possible. Can you create any scenario where an UNK is actually a prediction? I have never seen it in practice.

bonham79 commented 4 months ago

        log_probs = nn.functional.log_softmax(logits, dim=1) <-----From model softmax function. Over all vocabulary.
        if optim_actions is None:    <-------------  Validation step
            # Argmax decoding.
            next_action = [
                torch.argmax(probs, dim=0) if nc else self.actions.end_idx
                for probs, nc in zip(log_probs, not_complete)
            ]
        else:
            # Training with dynamic oracle; chooses from optimal actions.
            with torch.no_grad():
                if self.expert.explore():   <------------------- If true, the model is left to select on its own. No SED
                    # Action is picked by random exploration.
                    next_action = [
                        self.sample(probs) if nc else self.actions.end_idx
                        for probs, nc in zip(log_probs, not_complete)
                    ]
                else:  <-------------------------------------  If false, the model asks the expert. SED occurs here.
                    # Action is picked from optim_actions.
                    next_action = []
                    for action, probs, nc in zip(
                        optim_actions, log_probs, not_complete
                    ):
                        if nc:
                            optim_logs = probs[action]
                            idx = int(torch.argmax(optim_logs, 0))
                            next_action.append(action[idx])
                        else:  # Already complete, so skip.
                            next_action.append(self.actions.end_idx)
        return torch.tensor(next_action, device=self.device, dtype=torch.int)

Early in training there's a lot of exploraion. This is also when the softmax function is near equal probability distribution (since randomly initialized). So that's when OOV tokens can be called.

I still don't think that's sufficient for an UNK to be generated even if it's formally possible. Can you create any scenario where an UNK is actually a prediction? I have never seen it in practice.

Lemme setup some toy data.

bonham79 commented 4 months ago

I still don't think that's sufficient for an UNK to be generated even if it's formally possible. Can you create any scenario where an UNK is actually a prediction? I have never seen it in practice.

Sorry, had lunch and my brain clicked. Was conflating two different things. Ignore generation on training. The debug aspect is validation dataloaders will encode UNK symbols on tokenization. This is helpful for debugging since there's no confusion between characters seen in training and validation.

Not a strong answer, but another reason I think we should stick with only train vocab.

kylebgorman commented 4 months ago

What I suggest then:

1) Put the transducer change into a separate PR. (Actually I'd like to know why it's necessary and what's happening without it, but that's minor.) 2) Combine what remains with any needed redesign of the datamodule and with #161.

Adamits commented 4 months ago

I was skimming this via email while on the way home from the mountains yesterday and haven't reread in any detail yet, but generally my feeling is still that it seems better to not provide the dev/potentially test vocabulary items to the model during training. However, I think Kyle makes pretty good arguments that it does not really matter in many cases. One thing that matters to me is:

If a dev and/or test set is provided as an arg during training (we require a dev set), then a different dev vocabulary changes the model initialization (as it is a function of the vocabulary size). Pretty minor, but a bit awkward. If we want to implement UNK training stuff, then I think we need tricks to handle the fact that actual UNKs are known in our vocab.

Also:

We used to have a lot of code to ensure the vocabulary was consistent on reload. I found that caused problems and could be done away with.

All we really need to do is make sure that we always load the vocabulary from the training data. This can be via writing an index and loading it when using the model for inference, or enforcing that the training data is provided during inference for rereading the vocabulary. I don't think it should be complicated.

bonham79 commented 4 months ago

@kylebgorman The transducer change requires the dataloader fix. It's pointed out in the code snippet. As long as there's random search over the vocab, this issue will pop up. The alternative is to carry information regarding train vocab vs. dev vocab. And I do not see any rationale for doing that instead of simply basing vocabulary over train dataset only. For me to fix the transducer is to make the vocab change.

I'm really not understanding the contention. It's best practice to keep your model biased toward train information only. We have no rationale for doing it the current way, so should revert to the null case of just loading vocab off train.

bonham79 commented 4 months ago

@Adamits I thought we already saved index? (That's what i presume the pickle file is for.)

kylebgorman commented 4 months ago

All we really need to do is make sure that we always load the vocabulary from the training data. This can be via writing an index and loading it when using the model for inference, or enforcing that the training data is provided during inference for rereading the vocabulary. I don't think it should be complicated.

Not to get too far afield, but I think it's incredibly complicated as is. You're welcome to make proposals that reduce complexity of course, but having the index to read and write adds another painful step at every point. Let me put it to you this way. Can you write a class called, say, Predictor, which holds a trainer, index, etc. and which has methods for making batched and single-string predictions for source strings? It is of course possible, but can you do it in under 100 lines?

kylebgorman commented 4 months ago

@kylebgorman The transducer change requires the dataloader fix. It's pointed out in the code snippet.

The transducer change is one line and it removes an unused argument to the loss function. Its rationale seems unrelated to the data module and so should go in a separate PR with its own rationale description there.

bonham79 commented 4 months ago

Only committing that changes still leaves breaking behavior in the transducer whenever there's a vocabulary in dev that's not in train. The two go together for the transducer.

kylebgorman commented 4 months ago

Only committing that changes still leaves breaking behavior in the transducer whenever there's a vocabulary in dev that's not in train. The two go together for the transducer.

What does the loss function reduction string possibly have to do with the vocabulary? The connection isn't obvious...

bonham79 commented 4 months ago

The two go together as a full on bug fix. Only fixing one without the other still leaves the model with bugs. Both issues should be fixed together.

kylebgorman commented 4 months ago

The two go together as a full on bug fix. Only fixing one without the other still leaves the model with bugs. Both issues should be fixed together.

Could you please explain why? I haven't the foggiest how they could be connected, and one expects this kind of information in a PR, particularly when it looks to a casual observer like two unrelated changes.

bonham79 commented 4 months ago

They are two unrelated changes, but they both affect the transducer and without either of them, the transducer will fail. They do not play off each other, but they both need to be active for the transducer to not crash.

Changing loss: you never pass the reduction to the transducer. So it'll just crash.

Changing vocab: The transducer is allowed to random select edit actions based off vocab (which includes all of dev and train). Meanwhile, the expert only is exposed to train. So when the transducer chooses an action based off dev vocab, the expert will crash. i.e. anytime the dev vocabulary is disjoint with the train vocabulary.

Both cause crashing issues related to the same model. They are together because the transducer requires both to run. I can piecemeal them out into two separate PRs, but they're both going to need to be included for the target codebase to be functional.

kylebgorman commented 4 months ago

Thanks for clarification; I thought you were saying earlier than they were logically related beyond "needed to fix transducer".

I don't want to police "one PR vs. two" so this is fine. But I do think it probably should be combined with #161.

Adamits commented 4 months ago

Can you write a class called, say, Predictor, which holds a trainer, index, etc. and which has methods for making batched and single-string predictions for source strings?

I am sort of confused about this comment. We already have a predict method, which loads the pickled index and uses that. https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/predict.py#L53

It actually seems like loading the pickled index is pretty simple compared to everything else in the data loading process. Or do you mean to say that all of the code in indexes.py/supporting an Index class is only necessary because we want to support reusing the train index?

kylebgorman commented 4 months ago

Can you write a class called, say, Predictor, which holds a trainer, index, etc. and which has methods for making batched and single-string predictions for source strings?

I am sort of confused about this comment. We already have a predict method, which loads the pickled index and uses that. https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/predict.py#L53

Yes but all of that code is completely useless for anything other than putting together a command-line tool, and it's also over 300 lines, and the minimal prediction call requires reading 4 different files.

Adamits commented 4 months ago

Yes but all of that code is completely useless for anything other than putting together a command-line tool, and it's also over 300 lines, and the minimal prediction call requires reading 4 different files.

I wonder if the index can be stored in the torch model state dict. Would that simplify everything? Also, I was thinking we just need the model checkpoint, the file to make predictions on, and the index file. Am I forgetting something?

kylebgorman commented 4 months ago

On Feb 13, 2024, at 2:01 PM, Adam @.***> wrote:

Yes but all of that code is completely useless for anything other than putting together a command-line tool, and it's also over 300 lines, and the minimal prediction call requires reading 4 different files. I wonder if the index can be stored in the torch model state dict. Would that simplify everything?

Absolutely.

Also, I was thinking we just need the model checkpoint, the file to make predictions on, and the index file. Am I forgetting something?

Also the output file path.

Adamits commented 4 months ago

Ok, I think putting the index in the state_dict is something I intended to do with V1 of this way back when it was still in your dataset repository :D. I can look into it.

I want to note though that I still do not really see how this is in contention with building the vocabulary from just the train set. We need a vocabulary mapping from somewhere. It can come from an index that needs to be loaded (we can think of this like huggingface tokenizer), or it can be reproduced by reading the data. In both cases we want to have exactly one vocabulary, which must include the train set.

kylebgorman commented 4 months ago

I want to note though that I still do not really see how this is in contention with building the vocabulary from just the train set.

It's not that closely related, I admit.

bonham79 commented 4 months ago

So summarizing a few things (for my own mental sanity):

This PR combines two fixes needed for transducer training. First is a minor arg change. Second is moving vocab loading to being train data exclusive.
Because of the latter, probably a good idea to also include the fix that allows training of UNK tokens. (We don't know if it's a great idea, but should be available for experiments nonetheless).
Had a realization that vocab index should probably move into the state dict nonetheless. Adam was working on that previously.

This cover everything?

kylebgorman commented 4 months ago

+1 Travis

bonham79 commented 4 months ago

@kylebgorman @Adamits Added some goodies for https://github.com/CUNY-CL/yoyodyne/issues/161

Rather than setting a percent threshold (which would just be a pain to finetune cause corpus statistic and whatnot) I just added a coverage value to SymbolMap. So instead of caring about percentage of occurrence, the map just only encodes X% of tokens. Then we can let the rest of the _encode logics in dataclass take over as those tokens would just be encoded as UNK.

I can do it other ways if y'all really want the X% deal, but this seemed the cleanest (we can just replace the vocabularies in Index with Counter and there's barely any additional overhead). Also will have better haptics when experimenting. (It's more intuitive to me to think of 'let's try only 85% of the vocabularythani want vocbaulary that occurs more than 0.5% of the time.`

How to use

yoyodyne-train ......
--source_coverage=[0.0,1.0]| <- 1.0 is no opt\
--target_coverage=[0.0,1.0]| <- ditto
--features_coverage=[0.0,1.0] <- don't know why this will be used, but might as well

Args are passed to the datamodule class, then passed as optional arguments to _make_index, which just passes it down to SymbolMap instantiation.

kylebgorman commented 4 months ago

There are a ton of different ways one might specify an UNK threshold: positive or negative count, positive or negative % of corpus, positive or negative % of vocabularies.

It seems to me this is the least interpretable of the three, and is not what I had in mind for #161. Can I talk you down to doing it with counts instead? That's what's done in FairSeq and is the only one that I have seen prior code for in this domain; it certainly is the simplest. Exploring the options could be interesting but it seems low priority, and will necessarily result in a more complicated interface because those are either mutually exclusive (or have to be applied in some ordered fashion).

kylebgorman commented 4 months ago

Another possibility I want to put out there: we put all three ways of filtering into a (proper) fork of the repo, and you can experiment with them, figure out how to deal with the mutual exclusivity and/or ordering issues, maybe also figure out best practices (heck, why not write about it? the universe absolutely needs a paper about UNKing strategies for NNs) and merge it back in when you're ready. (This might justify separating the data loader fix even if it's semi-no-op.)

bonham79 commented 4 months ago

If we're okay with forking that's fine. I can also just do comparisons against random masking and just averaging the UNK embedding (the former I think will likely accomplish be the improvement imo). I'll do a roll back but can we at least merge these changes?

kylebgorman commented 4 months ago

Forks are great, so long as they get merged back in. ;)

Could you do the following?

Update the PR description to indicate what the actual changes in this PR are, and relate them to other issues. (Because I am not sure waht you mean by "these changes"...not that I couldn't figure out, just that the commit/PR record should reflect things.)
Update (or comment in) #161, or create a new issue, to indicate your plans there.
Do the fork work.

I think then the contents of this PR should just be a change to the data loader (and associated one-line fix in transducer is also kosher), is my assumption correct?

bonham79 commented 4 months ago

Yep!

kylebgorman commented 4 months ago

@bonham79 ping when ready for review

bonham79 commented 4 months ago

@kylebgorman all checks pass and my changes are done.

bonham79 commented 4 months ago

Switched logging to just raising errors. Should at least make life easier on a debug.

bonham79 commented 4 months ago

So related to this pr: I'm realizing that this setup has a problem with evaluation. If the predict file is not passed at training, it means chars not present in train or dev but present in predict are going to be encoded as UNK by the dataloader. So predictions can be slightly off. (Shouldn't be a common occurence with our target tasks, but something to keep in mind.)

kylebgorman commented 4 months ago

I'm not sure I follow your last point. While you can create a data module with prediction data, it isn't used to build the index. So it doesn't matter if it's passed or not, you get the same index. I thought that was the point.

What then do you mean by "predictions can be slightly off"?

bonham79 commented 4 months ago

Ignore comment, I'm a bit overwork and I think my brain is a few stations late.

We can merge.

Adamits commented 4 months ago

Sorry im late to review. This LGTM though I saw one tiny thing I will open a PR for.

CUNY-CL / yoyodyne

Dataloader hot fix #163