danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.75k stars 1.09k forks source link

Export Google Presentation to pttx instead using get_media #1665

Open sacovo opened 1 week ago

sacovo commented 1 week ago

Fixes #1664 by exporting the presentation as pptx file.

vercel[bot] commented 1 week ago

@sacovo is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

onimsha commented 1 week ago

@sacovo I tested your PR, and got this error

googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/drive/v3/files/my-file-name/export?mimeType=application%2Fvnd.op │
│ enxmlformats-officedocument.presentationml.presentation returned "This file cannot be exported by the user.". Details: "[{'message': 'This file cannot be exported by the user.', 'domain': 'global', ' │
│ reason': 'cannotExportFile'}]">

Have you tested it on your end ?

sacovo commented 1 week ago

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore

def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."

    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}

    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()

    print(content[:30]) # Some binary data

    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.

if __name__ == "__main__":
    main()

Do you get different output?

onimsha commented 1 week ago

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore

def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."

    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}

    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()

    print(content[:30]) # Some binary data

    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.

if __name__ == "__main__":
    main()

Do you get different output?

I'm not so sure. After running with this new logic, I reindexed all the files in Google Drive but all the presentations are being marked at ignore_for_qa, indicates that Danswer can't extract the text from these files. I will need to setup a similar debug like yours to see what's my problem is.

onimsha commented 1 week ago

I tested on my side, and one of my problem is it usually hit this error

[{'message': 'This file is too large to be exported.', 'domain': 'global', 'reason': 'exportSizeLimitExceeded'}]

Turns out when exporting the Google spreadsheet, the exported file size is usually big. I found one relevant article to go around this limitation https://stackoverflow.com/questions/40890534/google-drive-rest-api-files-export-limitation , I'll try to test it to see if it works.