Comparison with other tuning methods

Thanks for your interesting work! I've some questions about your work:

In my opinion, TruthfulQA is just an ordinary dataset, and there is no difference between it and other datasets (like MedicalQA). So your work is just an interesting method to regress given datasets (by adjusting the distribution), or it can impore the general ability of the model to generate more "truthful" answer?
In Table1, you compared your method with Supervised Finetuning and Few-shot prompting, Is there more comparison between your mothod and other tuning methods like LLaMA+LoRA? If possible, could you compare your method with LLaMA+LangChain, because in practice if we want LLaMA to generate more precise answer, we'd consider LLaMA+LangChain first, though I think this method is inelegant and don't like the idea of LLM with database.

CStanKonrad / long_llama