Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.44k stars 617 forks source link

Adding filters to paper-qa Docs #707

Closed whitead closed 2 days ago

whitead commented 2 days ago

This adds a new filter mechanism to exclude papers from the Docs object via settings.

For example, to exclude a specific DOI

settings = Settings()
settings.parsing.doc_filters = [{"!doi": "xxxx/xxxxxx"}]

Or to only consider years 2020 and 2018:

settings.parsing.doc_filters = [
    {"year": "2020"},
    {"year": "2018"}]

Description:

Optional filters to only allow documents that match this filter. This is a dictionary where the keys are the fields from DocDetails or Docs to filter on, and the values are the values to filter for. to invert filter, prefix the key with a '!'. If the key is not found, by default the Doc is rejected. To change this behavior, prefix the key with a '?' to allow the Doc to pass if the key is not found. For example, {'!title': 'bad title', '?year': '2022'} would only allow Docs with a title that is not 'bad title' and a year of 2022 or no year at all.