airbytehq / PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.
https://docs.airbyte.com/pyairbyte
Other
234 stars 41 forks source link

Error dealing with pdf files #436

Open soham-aiplanet opened 4 weeks ago

soham-aiplanet commented 4 weeks ago

This is my code

def process_google_drive_documents(folder_url: str, service_account_cred: dict):
    source = ab.get_source(
        "source-google-drive",
        config={
            "folder_url": folder_url,
            "credentials": {
                "auth_type": "Service",
                "service_account_info": json.dumps(service_account_cred),
            },
            "streams": [
                {
                    "name": "pdf_loader_stream",
                    "globs": ["**"],
                    "format": {"filetype": "unstructured"},
                }
            ],
        },
    )

    source.check()
    source.select_all_streams()
    read_result = source.read()

And here's the error - [Document(page_content='', metadata={'_ab_source_file_last_modified': '2023-11-28T19:43:49.000000Z', '_ab_source_file_url': 'TermPaper.docx', 'document_key': 'TermPaper.docx', '_ab_source_file_parse_error': "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=TermPaper.docx message=\n**\n Resource \x1b[93mpunkt_tab\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt_tab')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt_tab/english/\x1b[0m\n\n Searched in:\n - '/home/soham/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/share/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**\n", '_airbyte_raw_id': '01JAQ6ZEB720CS3BNHYVMKFQEC', '_airbyte_extracted_at': datetime.datetime(2024, 10, 21, 9, 36, 50, 530000), '_airbyte_meta': {}, 'last_modified': '2024-10-21T15:06:52.694685'})]

Any idea how to resolve this ?

pinaak-goel commented 3 weeks ago

Hi @soham-aiplanet ! I see the error message you encountered, and I believe it has to do with a missing resource in the Natural Language Toolkit (NLTK) library. The error appears because the punkt tokenizer is needed to parse text in the document, but it’s not currently available in your environment. To resolve this, please install punkt by running:

import nltk
nltk.download('punkt')

After installing it, try running the code again, and the error should be resolved. Please let me know if this works or if you run into any other issues.