Open EIFY opened 1 year ago
@suchenzang
Hm, none of the cleanup PRs should have touched state dict logic, much less layer norms. The last time state dicts were touched was in https://github.com/facebookresearch/metaseq/pull/229 I think.
@EIFY do you see this same error in the 125m model? 350m was the only one trained without model parallelism, which has caused some issues in the past with integration.
Hmm, but https://github.com/facebookresearch/metaseq/pull/229 was merged in Jul 16. I can try git bisect
tomorrow but I am certain that the 350m model worked for me in Sep.
I haven't been able to run non-model parallelism models due to another issue (https://github.com/facebookresearch/metaseq/issues/378) 🙃
I did a bisect, this is commit that started causing the error: https://github.com/facebookresearch/metaseq/commit/493e6017c18f7c2d3cd697693e6f9e33592f3612
cc @lilisierrayu
After commenting out line suggested, second error is caused by this commit in particular https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282
Ok did a bit of digging with @suchenzang, here is the summary:
setattr(cfg["model"], "inference", True)
from https://github.com/facebookresearch/metaseq/commit/493e6017c18f7c2d3cd697693e6f9e33592f3612 is a bug, figuring out best way to fix it and putting out a fix;On the second problem, turns out that 350M model, given that it was trained without model parallelism, it ended up unintentionally without layer norm. In the changes from https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282, in metaseq/models/transformer_decoder.py the issue is fixed by reverting the changes there or more explicitly setting self.layer_norm = None
.
Suggested actions: 1/ Put up a fix for first problem 2/ Keep code related to second issue as is, and instead retrain 350M model with layer norms 3/ Merge code-paths with and without model parallel to avoid similar problems in the future
I think the first issue can be fixed by a one-line change (see this OmegaConf documentation):
with omegaconf.open_dict(cfg):
setattr(cfg["model"], "inference", True)
Missing key(s) in state_dict: "decoder.layer_norm.weight", "decoder.layer_norm.bias".
There is a solution?
@andchir we haven't retrained the 350M model yet but if locally you set self.layer_norm = None
in metaseq/models/transformer_decoder.py it should work
@ruanslv Thanks for the answer. It helped, the error does not occur. But I am getting strange text generation results. Example:
The technology world is reeling after Facebook ($FB) announced today are have have have have are are have have have have have have are have have have are have are have have have have are have are are are are are have have have have are have have have are are have have are have have are are have have have are are have have have are have have have have are have have have have are have have have have have have have have have have have are have are are have have have are have have have have have have are are have have have have are have ...
I think I should use a different model. Can you help me set up the constants? I don't understand what I should specify in the parameter if the model has only parts.
MODEL_FILE = os.path.join(CHECKPOINT_FOLDER, "reshard.pt") # I don't have such a file, I only have "reshard-model_part-0.pt", "..._part-1.pt"
I am trying to use OPT-1.3B.
Just curious: before the breaking change https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282, we had
https://github.com/facebookresearch/metaseq/blob/50dbe6077bbb977cdd2a7b02ce778ffcf29e829e/metaseq/model_parallel/models/transformer_lm.py#L111-L112
where I believe args.decoder_normalize_before
does two things:
Was the stability issue fixed by 1 & 2 together, or 1 alone? If 1 alone was sufficient, what is the rationale for the final layer norm? Evidently, the 350M model training was stable without it 😅
I also noticed that in comparison to RobertaLMHead
, self.dense
, self.activation_fn
, and self.bias
for the final projection back to size of vocabulary are eliminated. I don't know if there are history / rationales / experiments behind these decisions.
🐛 Bug
No longer able to load provided OPT checkpoint after recent changes
To Reproduce
Edit
metaseq/service/constants.py
as before, in my case:where
and then run
metaseq-api-local
, but it no longer works:Apparently this can be traced back to when
setattr(cfg["model"], "inference", True)
was added (https://github.com/facebookresearch/metaseq/pull/356). However, another issue surfaced even with that line commented out:which seems to be due to recent cleanup PRs (https://github.com/facebookresearch/metaseq/pull/366, https://github.com/facebookresearch/metaseq/pull/380, https://github.com/facebookresearch/metaseq/pull/381).
Expected behavior
metaseq-api-local
up & runningEnvironment
pip