cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
614 stars 191 forks source link

Default value of min_df in BM25Retriever is wrong. #339

Open andreshazard opened 4 years ago

andreshazard commented 4 years ago

On the file retriever_sklearn.py the docstring of the class BM25Retriever states that the default value of the parameter _mindf is 1

    min_df : float in range [0.0, 1.0] or int
        When building the vocabulary ignore terms that have a document frequency
        strictly lower than the given threshold. This value is also called cut-off
        in the literature. If float, the parameter represents a proportion of
        documents, integer absolute counts. This parameter is ignored if vocabulary
        is not None. (default is 1)

However it is set to 2

    def __init__(
        self,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words="english",
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 2),
        max_df=0.85,
        **min_df=2**,
        vocabulary=None,
        top_n=20,
        verbose=False,
        k1=2.0,
        b=0.75,
        floor=None,
    ):

Shouldn't the value be 1, as the docstring states ? or should the docstring be updated ? I personally get better results the value as 1.

I can create a pull request with either of the solutions.

Thanks team.