HuskyInSalt / CRAG

Corrective Retrieval Augmented Generation
277 stars 26 forks source link

Fix: data preprocessing in inference produces passages list that doesn't match queries list #2

Open matthmeyer opened 7 months ago

matthmeyer commented 7 months ago

The function data_preprocess(file) in CRAG_Inference should produce an passages array of same length of queries array. However while testing with the Popqa dataset, I realized that the passages array is much longer than the queries array.

The reason is a wrong indentation. tmp_psgs is appended to passages after every line in the preprocessed file. However, tmp_psgs should only be appended if the query is different from last line's query or at the end of looping through the lines. A different indentation fixes the bug to the intended behavior.

HuskyInSalt commented 7 months ago

Both passages and queries in data_preprocess(file) append new items at the same time when the input query differs from the previous line. Thus they should have the same length.

The role of tmp_psgs is to collect all retrieved passages that are retrieved with the same single query and will only be appended when the current query changes (q != queries[-1]).