Hi, I've read your paper and have some questions about the MLM you mentioned: "... we mask out the value of coordinate or direction vectors in the input text prompt and promote the model to infill the missing characters."
As far as I know, LLaMA is a decoder-only transformer-based LLM (aka next token prediction). This differs from encoder-only LLMs like BERT, which can apply MLM; it can only access the information before the to-predict token. How did you apply MLM on LLaMA? Did you modify the casual mask?
Hi, I've read your paper and have some questions about the MLM you mentioned: "... we mask out the value of coordinate or direction vectors in the input text prompt and promote the model to infill the missing characters."
As far as I know, LLaMA is a decoder-only transformer-based LLM (aka next token prediction). This differs from encoder-only LLMs like BERT, which can apply MLM; it can only access the information before the to-predict token. How did you apply MLM on LLaMA? Did you modify the casual mask?