deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.77k stars 1.84k forks source link

Cannot get filtering through API to work #2249

Closed fvanlitsenburg closed 2 years ago

fvanlitsenburg commented 2 years ago

Question Currently I am running Haystack on an EC2 instance. It works just fine, except that the query filters do not apply. What am I doing wrong? I run this on the port :8000/docs. I can share the EC2 instance, but would rather do so on a private channel. I am running Haystack on top of ElasticSearch.

{
  "query": "Wat is de definitie van persoonsgegevens?",
  "params": {"filters": {"areaLevel2": ["Belastingrecht","Overig civiel recht"]}}
}

The return output I get is below. I have bolded the relevant filter which shows that they have not been applied, it seems like.

{
  "query": "Wat is de definitie van persoonsgegevens?",
  "answers": [
    {
      "answer": "de Wbp",
      "type": "extractive",
      "score": 0.5827369391918182,
      "context": "p grond van artikel 35 van de Wet bescherming persoonsgegevens (hierna: de Wbp) aan verweerder verzocht een volledig overzicht te verstrekken van de d",
      "offsets_in_document": [
        {
          "start": 471,
          "end": 477
        }
      ],
      "offsets_in_context": [
        {
          "start": 72,
          "end": 78
        }
      ],
      "document_id": "ECLI:NL:RBARN:2011:BU9079",
      "meta": {
        "Unnamed: 0": 166911,
        "issued": "2013-04-05",
        "judicialAuthority": "Rechtbank Arnhem",
        "date": "2011-10-05T00:00:00",
        "caseNumber": "AWB 10/2952",
        "caseType": "Uitspraak",
        "procedure": "Eerste aanleg - enkelvoudig",
        "journalComments": "['Rechtspraak.nl']",
        "summary": "      Wet bescherming persoonsgegevens. Bestand. Verslagen verzuimgesprekken; b&w-besluitadviezen; worddocument in digitale map.    ",
        "title": "ECLI:NL:RBARN:2011:BU9079: zaak AWB 10/2952 van 2011-10-05 bij de Rechtbank Arnhem. Eerste aanleg - enkelvoudig",
        "fileSize": 13625,
        "judicialAuthorityLevel1": "Rechtbank Gelderland",
        "judicialAuthorityLevel2": null,
        "BES": "Niet-BES",
        "areaLevel1": "Bestuursrecht",
        "areaLevel2": null,
        "areaLevel3": null,
        "documentType": "Jurisprudentie",
        "added": "2021-09-30T00:00:00",
        "name": "25638"
      }
    },
    {
      "answer": "Wbp",
      "type": "extractive",
      "score": 0.18006868660449982,
      "context": "als bedoeld in artikel 8 van de Wet bescherming persoonsgegevens (hierna: Wbp). Er is geen sprake van onevenredige benadeling van eiser of van disprop",
      "offsets_in_document": [
        {
          "start": 2620,
          "end": 2623
        }
      ],
      "offsets_in_context": [
        {
          "start": 74,
          "end": 77
        }
      ],
      "document_id": "ECLI:NL:RBDHA:2017:15813",
      "meta": {
        "Unnamed: 0": 142882,
        "issued": "2018-01-17",
        "judicialAuthority": "Rechtbank Den Haag",
        "date": "2017-12-01T00:00:00",
        "caseNumber": "AWB - 17 _ 4140",
        "caseType": "Uitspraak",
        "procedure": "Eerste aanleg - enkelvoudig",
        "journalComments": "['Rechtspraak.nl']",
        "summary": "      Verzoek om de plaatsing op een ARO-lijst (Agressie Registratiesysteem Overheden) ongedaan te maken, afgewezen.    ",
        "title": "ECLI:NL:RBDHA:2017:15813: zaak AWB - 17 _ 4140 van 2017-12-01 bij de Rechtbank Den Haag. Eerste aanleg - enkelvoudig",
        "fileSize": 18868,
        "judicialAuthorityLevel1": "Rechtbank Den Haag",
        "judicialAuthorityLevel2": null,
        "BES": "Niet-BES",
        "areaLevel1": "Bestuursrecht",
        **"areaLevel2": null,**
        "areaLevel3": null,
        "documentType": "Jurisprudentie",
        "added": "2021-09-30T00:00:00",
        "name": "35883"
      }
    },
    {
      "answer": "arts",
      "type": "extractive",
      "score": 0.1736966371536255,
      "context": "dat het verzamelen en verwerken van persoonsgegevens van [eiseres] , die arts is, gelet op de doeleinden waarvoor zij worden verzameld en verwerkt, to",
      "offsets_in_document": [
        {
          "start": 15993,
          "end": 15997
        }
      ],
      "offsets_in_context": [
        {
          "start": 73,
          "end": 77
        }
      ],
      "document_id": "ECLI:NL:RBOVE:2019:3755",
      "meta": {
        "Unnamed: 0": 115315,
        "issued": "2019-10-18",
        "judicialAuthority": "Rechtbank Overijssel",
        "date": "2019-10-09T00:00:00",
        "caseNumber": "C/08/224031 / HA RK 18-146",
        "caseType": "Uitspraak",
        "procedure": "Beschikking",
        "journalComments": "['Rechtspraak.nl', 'GZR-Updates.nl 2019-0263', 'PS-Updates.nl 2019-1232']",
        "summary": "      Verzoekschriftprocedure als bedoeld in artikel 35 UAVG. Gevraagde verklaringen voor recht gaan het bestek van deze verzoekschriftprocedure te buiten en worden daarom afgewezen. Het verzoek ZorgkaartNederland te gebieden de persoonsgegevens van verzoekster te verwijderen en verwijderd te houden is ontvankelijk. Geen zwarte lijst waarvoor een vergunning van de Autoriteit Persoonsgegevens is vereist. ZorgkaartNederland is geen overheidsinstantie in de zin van artikel 6 lid 1 sub f AVG. Verwerking van de persoonsgegevens is noodzakelijk voor de behartiging van de gerechtvaardigde belangen van ZorgkaartNederland. Onvoldoende aanleiding het belang van verzoekster te laten prevaleren. Volgt afwijzing van het verzochte.    ",
        "title": "ECLI:NL:RBOVE:2019:3755: zaak C/08/224031 / HA RK 18-146 van 2019-10-09 bij de Rechtbank Overijssel. Beschikking",
        "fileSize": 32859,
        "judicialAuthorityLevel1": "Rechtbank Overijssel",
        "judicialAuthorityLevel2": null,
        "BES": "Niet-BES",
        "areaLevel1": "Civiel recht",
        **"areaLevel2": "Overig civiel recht",**
        "areaLevel3": "Civiel recht",
        "documentType": "Jurisprudentie",
        "added": "2021-09-30T00:00:00",
        "name": "32559"
      }
    },

Additional context Steps I have already taken to address the issue:

YAML pipeline:

version: '0.9'

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      host: localhost
  - name: Retriever-Rank
    type: ElasticsearchRetriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 5
  - name: Retriever
    type: ElasticsearchRetriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 5
  - name: Reader       # custom-name for the component; helpful for visualization & debugging
    type: FARMReader    # Haystack Class name for the component
    params:
      model_name_or_path: henryk/bert-base-multilingual-cased-finetuned-dutch-squad2
      use_gpu: True
  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
    type: SentenceTransformersRanker    # Haystack Class name for the component
    params:
      model_name_or_path: amberoad/bert-multilingual-passage-reranking-msmarco
  - name: Classifier       # custom-name for the component; helpful for visualization & debugging
    type: TransformersQueryClassifier    # Haystack Class name for the component
    params:
      model_name_or_path: shahrukhx01/question-vs-statement-classifier
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 1000
  - name: FileTypeClassifier
    type: FileTypeClassifier

pipelines:
  - name: query    # a sample extractive-qa Pipeline
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    type: Indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter

FAQ Check

PS this is my first-ever GitHub request, do please point to anything that could have made the request better :)

fvanlitsenburg commented 2 years ago

Screenshot:

image

bogdankostic commented 2 years ago

Hey @fvanlitsenburg! I just tried replicating your problem with our current master version, where it seems to be working just fine. We recently refactored filtering in haystack, so upgrading to the latest version (1.2) might solve your problem.

fvanlitsenburg commented 2 years ago

Thanks Bogdan. I was in fact using version 1.2 already (sorry, my bad), at least according to the version.txt file and pip. However, on the :8000/docs Swagger page it says version 0.1 and if I look at application.py in the rest_api folder, I think that should not be the case.

So I suspect something may have gone wrong migrating from <1.0 to 1.2. In particular as I do not encounter this problem when I set Haystack up from scratch.

Some other things I tried:

I will try a few more things and then let you know if I find a solution

image

image

tstadel commented 2 years ago

@fvanlitsenburg what you could try in addition is to compare the index mappings between a working (from scratch) index and a non working index. Simply type localhost:9200/INDEX_NAME into your browser on the machine you're running haystack. Maybe you find any clues about the problematic property.

ZanSara commented 2 years ago

Hey @fvanlitsenburg, did you manage to get the filters to work eventually? if you can share some insights on how you debugged this, I believe it could help a lot other people migrating from 1.0!

fvanlitsenburg commented 2 years ago

Hi Sara,

Thanks! To be honest, I opted for the path of least resistance and installed Haystack from scratch...

Best, Felix

On Mon, Mar 21, 2022 at 11:35 AM Sara Zan @.***> wrote:

Hey @fvanlitsenburg https://github.com/fvanlitsenburg, did you manage to get the filters to work eventually? if you can share some insights on how you debugged this, I believe it could help a lot other people migrating from 1.0!

— Reply to this email directly, view it on GitHub https://github.com/deepset-ai/haystack/issues/2249#issuecomment-1073735512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOM7YQXMDZ3VDJT3SEUYMDVBBGFNANCNFSM5PIHPWHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

SjSnowball commented 2 years ago

@fvanlitsenburg Just a suggestion, make sure "areaLevel2" field is indexed as "keyword" type in Elasticsearch. As @tstadel told, you shall check your mapping with this localhost:9200/INDEX_NAME/_mapping

tstadel commented 2 years ago

I'm closing this now as it seems to be solved for now.