ninalopatina commented 4 months ago

Describe the bug Google Docs/Sheets/Slides not working in the V2 SDK Google Drive source connector

To Reproduce

Ingesting from Google Drive, partitioning via Unstructured API, embedding via OpenAI,and writing to AstraDB

runner = GoogleDriveRunner( processor_config=ProcessorConfig( verbose=True, output_dir=os.environ['GOOGLE_DRIVE_OUTPUT'], num_processes=2, ), read_config=ReadConfig(), partition_config=PartitionConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY") ), connector_config=SimpleGoogleDriveConfig( access_config=GoogleDriveAccessConfig( service_account_key=os.getenv("GOOGLE_DRIVE_ACCOUNT_KEY") ), recursive=True, drive_id=os.getenv("GOOGLE_DRIVE_FOLDER_ID"), ), chunking_config=ChunkingConfig(chunk_elements=True), embedding_config=EmbeddingConfig( provider="langchain-openai", api_key=os.getenv("OPENAI_API_KEY"), ), writer=get_writer(), writer_kwargs={}, )

Expected behavior As in V1, I expect the file to be parsed

Screenshots

KeyError Traceback (most recent call last)

in () 33 stager_config=WeaviateUploadStagerConfig(), 34 uploader_config=WeaviateUploaderConfig(), ---> 35 ).run() 7 frames /usr/local/lib/python3.10/dist-packages/unstructured/ingest/v2/processes/connectors/google_drive.py in map_file_data(f) 131 file_id = f["id"] 132 filename = f.pop("name") --> 133 url = f.pop("webContentLink") 134 version = f.pop("version", None) 135 permissions = f.pop("permissions", None) KeyError: 'webContentLink' **Environment Info** This doesn't only happen in my env but also for anyone else that tries this snippet **Additional context** Add any other context about the problem here.

adrian-ciz-intive commented 3 months ago

Same thing happens to me when trying to parse a GDrive word document with some tables, images, TOC, header, footer, etc. about 30 pages long.

SantoshKumarRavi commented 3 days ago

is anyone getting this issue in google drive v2 ingestion ?

2024-11-19 22:12:47,003 SpawnProcess-18 ERROR    
C:\Users\SANTHOSH\.cache\unstructured\ingest\pipeline\index\34b4026053f1.json: [download] 'GoogleDriveDownloader' object has no attribute 'meta'

micmarty-deepsense commented 3 days ago

Thanks for reporting that @SantoshKumarRavi! It's a bug (see this line). We need to prepare a fix for that.

Unstructured-IO / unstructured-ingest

Google Docs/Sheets/Slides not working in the V2 SDK Google Drive source connector #74

Ingesting from Google Drive, partitioning via Unstructured API, embedding via OpenAI,and writing to AstraDB

Screenshots