hello,I notice in this table that positional encoding-without the forgetting gate gives a significant performance boost to the model, with ape alone (80.0%) +1.6% compared to the forgetting gate (78.4%); lepe only (81.6%) +3.2%; Only adding cpe (81.7%) +3.3%; Only rope (80.0%) +1.6% is added. How do you choose them? In your code, you added cpe+rope+lepe, but not ape. Have you done any more specific ablation experiments?
hello,I notice in this table that positional encoding-without the forgetting gate gives a significant performance boost to the model, with ape alone (80.0%) +1.6% compared to the forgetting gate (78.4%); lepe only (81.6%) +3.2%; Only adding cpe (81.7%) +3.3%; Only rope (80.0%) +1.6% is added. How do you choose them? In your code, you added cpe+rope+lepe, but not ape. Have you done any more specific ablation experiments?