Open hujunyi96 opened 1 year ago
@hujunyi96 Thanks for checking the paper! Please take a look at the baseline subsections in Experiments section.
1.MPCFORMERw/o{d} also constructs the approximated model S‘ but trains’ on D with the task-specific objective, i.e., without distillation. We note that S‘ is initialized with weights in T , i.e., with different functions, whose effect has not been studied. We thus propose a second baseline MPC-FORMERw/o{p,d}, which trains S’on D without distillation, and random weight initialization.(from the baseline subsections in Experiments section); 2.“p” stands for using weights in T as initialization, “d” stands for applying knowledge distillation with T as the teacher.(From Table2)
Hello, @DachengLi1 ,I think the two expressions are contradictory, aren't they? Simply put it, could you please directly explain what procedures 1.MPCBert-B, 2.MPCBert-Bw/o{d}, 3.MPCBert-Bw/o{p,d} have gone through respectively? Thanks a lot!
@hujunyi96 Definitely! Assuming a Bert-Base with CoLA example, T means a Bert-base fine-tuned on CoLA (1) MPCBert-B is our method: trained with distillation objective with T as the teacher, started from a Bert-Base. (2) MPCBert-B w/o {d}: trained with task objective, started from a Bert-Base. (3) MPCBert-B w/o {p,d}: trained with task objective, started from a randomly initialized Bert-Base architecture (not trained at all).
Note: all of these three models are S', which uses approximation. Only T uses GeLU+Softmax, if that is confusing.
hello, @DachengLi1 , when I was trying to use the param "--hidden_act quad" to train baselines with appromations, which are the first major innovation in your paper(the second one would be Distillation), an error occurred: KeyError:'quad'. That means the source code of transformer libs such as 'hidden_act' in BertConfig class don't support the new activation funcs in your paper(exact lib files that cause this error are: xxx/site-package/transformers/activations.py, line 208, in getitem) . That said, I wonder how did you realize the quad function since the current code in this repo has the error above when running. Did you change source python lib files?
I am looking forward to your reply, thanks!
@hujunyi96 We have a modified version of Transformers that will do this https://github.com/DachengLi1/MPCFormer/tree/main/transformers. In particular here:https://github.com/DachengLi1/MPCFormer/blob/38cb42cb194bfaa2d8deb1e7a9ce7e33543e7519/src/main/transformer/modeling.py#L139. Maybe you are using the one in your environment? Should be easy to fix by checking some file path.
Or even simpler, you can just copy paste these several new functions to whereever you want them to be.
@DachengLi1 I see. I was following the main procedures in README.md in/baselines folder, as you can see the commands listed are actually exectuting run_glue.py, which seems doesn't import [MPCFormer/src/main/transformer/modeling.py] as a module. So I didn't notice the module is already in the project. Thanks for your help!
How does the command "pip install -e ." executed in path "/MPCFormer/transformers" achieves installing modules in a different path "[/src/main/transformer/]"?
As is written in your article:.“p” stands for using weights in T as initialization, “d” stands for applying knowledge distillation with T as the teacher. My question is:Does “using weights in T as initialization” mean fine-tuned model? E.g.”p” stands for “fine-tuning”, namely, 1.MPCBert-B stands for the most basic pre-trained transformer, 2.MPCBert-Bw/o{d} stands for applying KD on the most basic pre-trained transformer, 3. MPCBert-Bw/o{p,d} stands for applying KD on fine-tuned transformer?