Libr-AI / OpenFactVerification

Loki: Open-source solution designed to automate the process of verifying factuality
https://loki.librai.tech/
MIT License
1.03k stars 45 forks source link

Resource punkt not found. . with llama3 #14

Closed shuther closed 7 months ago

shuther commented 7 months ago

I tried the project with llama3 using: poetry run python -m factcheck --modal string --input "MBZUAI is the first AI university in the world" --client local_openai --model llama3 --prompt factcheck/config/sample_prompt.yaml but I end up with a lib not setup. not sure if it is expected?

[2024-04-20 10:42:20 - httpx:1026 - INFO] HTTP Request: POST http://linuxmain.local:4000/chat/completions "HTTP/1.1 200 OK"
[2024-04-20 10:42:20 - openai._base_client:986 - DEBUG] HTTP Request: POST http://linuxmain.local:4000/chat/completions "200 OK"
[ERROR]2024-04-20 10:42:20,511 Decompose.py:60: Parse LLM response error eval() arg 1 must be a string, bytes or code object, response is: None
[2024-04-20 10:42:20 - FactCheck:60 - ERROR] Parse LLM response error eval() arg 1 must be a string, bytes or code object, response is: None
[ERROR]2024-04-20 10:42:20,511 Decompose.py:61: Parse LLM response error, prompt is: [[{'role': 'system', 'content': 'You are a helpful assistant designed to output JSON.'}, {'role': 'user', 'content': 'Your task is to decompose the text into atomic claims.\nThe answer should be a JSON with a single key "claims", with the value of a list of strings, where each string should be a context-independent claim, representing one fact.\nNote that:\n1. Each claim should be concise (less than 15 words) and self-contained.\n2. Avoid vague references like \'he\', \'she\', \'it\', \'this\', \'the company\', \'the man\' and using complete names.\n3. Generate at least one claim for each single sentence in the texts.\n\nFor example,\nText: Mary is a five-year old girl, she likes playing piano and she doesn\'t like cookies.\nOutput:\n{"claims": ["Mary is a five-year old girl.", "Mary likes playing piano.", "Mary doesn\'t like cookies."]}\n\nText: MBZUAI is the first AI university in the world\nOutput:'}]]
[2024-04-20 10:42:20 - FactCheck:61 - ERROR] Parse LLM response error, prompt is: [[{'role': 'system', 'content': 'You are a helpful assistant designed to output JSON.'}, {'role': 'user', 'content': 'Your task is to decompose the text into atomic claims.\nThe answer should be a JSON with a single key "claims", with the value of a list of strings, where each string should be a context-independent claim, representing one fact.\nNote that:\n1. Each claim should be concise (less than 15 words) and self-contained.\n2. Avoid vague references like \'he\', \'she\', \'it\', \'this\', \'the company\', \'the man\' and using complete names.\n3. Generate at least one claim for each single sentence in the texts.\n\nFor example,\nText: Mary is a five-year old girl, she likes playing piano and she doesn\'t like cookies.\nOutput:\n{"claims": ["Mary is a five-year old girl.", "Mary likes playing piano.", "Mary doesn\'t like cookies."]}\n\nText: MBZUAI is the first AI university in the world\nOutput:'}]]
[INFO]2024-04-20 10:42:20,511 Decompose.py:63: It does not output a list of sentences correctly, return self.doc2sent_tool split results.
[2024-04-20 10:42:20 - FactCheck:63 - INFO] It does not output a list of sentences correctly, return self.doc2sent_tool split results.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/shuther/Documents/Projects/OpenFactVerification/factcheck/__main__.py", line 45, in <module>
    check(args)
  File "/home/shuther/Documents/Projects/OpenFactVerification/factcheck/__main__.py", line 30, in check
    res = factcheck.check_response(content)
  File "/home/shuther/Documents/Projects/OpenFactVerification/factcheck/__init__.py", line 76, in check_response
    claims = self.decomposer.getclaims(doc=response)
  File "/home/shuther/Documents/Projects/OpenFactVerification/factcheck/core/Decompose.py", line 64, in getclaims
    claims = self.doc2sent(doc)
  File "/home/shuther/Documents/Projects/OpenFactVerification/factcheck/core/Decompose.py", line 29, in _nltk_doc2sent
    sentences = nltk.sent_tokenize(text)
  File "/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/lib/python3.10/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/lib/python3.10/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/lib/python3.10/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/home/shuther/nltk_data'
    - '/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/nltk_data'
    - '/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/share/nltk_data'
    - '/home/shuther/.cache/pypoetry/virtualenvs/openfactverification-3iFqEQnw-py3.10/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
shuther commented 7 months ago

I followed the steps from: https://www.nltk.org/data.html poetry run python

import nltk nltk.download()

then I picked up the popular ones; not sure if it is what is needed?

haonan-li commented 7 months ago

Hi, The reason that trigger nltk at this step is your llm_client (llama3), does not return the string in required format. So I believe, even you solve this nltk issue, it will fail again in other steps.

Basically, in each step, we require the llm to return a json loadable string. However, we observed that currently most model do not have such ability. So you need to ensure this by write a post-process function in your model definition, please see the document here: https://github.com/Libr-AI/OpenFactVerification/blob/dev/docs/development_guide.md#new-llm-support