Closed BSharmi closed 1 year ago
Hi!
I hope this clarified some things. If something above is wrong, feel free to correct me (also not 100% sure on some of the things)
Thank you tons for the detailed response! At this point my code runs fine if I want to drop part of sequence and model training is good. But I was getting ambitious and tried padding, attention masks, and it works fine until it gets to collator but then gets some error due to the assertions. I think I might just go ahead without padding and drop the fragments unless I figure it out.
Thanks again for your help
Hi @mheinzinger!
Continuing from https://github.com/agemagician/ProtTrans/issues/113 I think I am almost there with the preprocessing with re-using https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py#L337 with slight modifications (span mask set to 1.0 being one of them)
Just have a few remaining questions I want to clarify
group_texts
function https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py#L688 the first key in examples isinput_text
which has a different length (by2
) from theinput_ids
key due to special tokens. This is expected? We want thetotal_length
to be same as input text rather than input ids?512
and the last one181
. after group_text is applied. Unless I specify a batch size of3
it is there. Seems hand wavy</s>
at the end so should it not beif batch["input_ids"].shape[-1] != self.input_length + 1:
rather thanif batch["input_ids"].shape[-1] != self.input_length:
?For e.g. my final inputs look like
raw sequence:
inputs:
and labels
Assertion on targets length match.
Thanks so much for your help! I am just hoping to start training soon!