Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.44k stars 618 forks source link

Fixing crash due to `None` author #650

Closed jamesbraza closed 3 weeks ago

jamesbraza commented 3 weeks ago

Seen last night when indexing 19k PDFs:

    | Traceback (most recent call last):
    |   File "/path/to/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 487, in process_file
    |     await tmp_docs.aadd(
    |   File "/path/to/.venv/lib/python3.12/site-packages/paperqa/docs.py", line 364, in aadd
    |     doc = await metadata_client.upgrade_doc_to_doc_details(
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/path/to/.venv/lib/python3.12/site-packages/paperqa/clients/__init__.py", line 207, in upgrade_doc_to_doc_details
    |     0 if not extra_fields else DocDetails(**extra_fields)
    |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/path/to/.venv/lib/python3.12/site-packages/pydantic/main.py", line 212, in __init__
    |     validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/path/to/.venv/lib/python3.12/site-packages/paperqa/types.py", line 577, in validate_all_fields
    |     data = cls.remove_invalid_authors(data)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/path/to/.venv/lib/python3.12/site-packages/paperqa/types.py", line 458, in remove_invalid_authors
    |     a for a in authors if a.lower() not in cls.AUTHOR_NAMES_TO_REMOVE
    |                           ^^^^^^^
    | AttributeError: 'NoneType' object has no attribute 'lower'
    +------------------------------------