Open sacovo opened 1 week ago
@sacovo is attempting to deploy a commit to the Danswer Team on Vercel.
A member of the Team first needs to authorize it.
@sacovo I tested your PR, and got this error
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/drive/v3/files/my-file-name/export?mimeType=application%2Fvnd.op │
│ enxmlformats-officedocument.presentationml.presentation returned "This file cannot be exported by the user.". Details: "[{'message': 'This file cannot be exported by the user.', 'domain': 'global', ' │
│ reason': 'cannotExportFile'}]">
Have you tested it on your end ?
I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:
from google.oauth2.service_account import Credentials
from googleapiclient import discovery # type: ignore
def main():
credentials = Credentials.from_service_account_file('...')
service = discovery.build("drive", "v3", credentials=credentials)
files = service.files()
file_id = "..."
print(files.get(fileId=file_id).execute())
# {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
print(content[:30]) # Some binary data
try:
files.get_media(fileId=file_id).execute()
except Exception as ex:
print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
if __name__ == "__main__":
main()
Do you get different output?
I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:
from google.oauth2.service_account import Credentials from googleapiclient import discovery # type: ignore def main(): credentials = Credentials.from_service_account_file('...') service = discovery.build("drive", "v3", credentials=credentials) files = service.files() file_id = "..." print(files.get(fileId=file_id).execute()) # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'} content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute() print(content[:30]) # Some binary data try: files.get_media(fileId=file_id).execute() except Exception as ex: print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files. if __name__ == "__main__": main()
Do you get different output?
I'm not so sure. After running with this new logic, I reindexed all the files in Google Drive but all the presentations are being marked at ignore_for_qa
, indicates that Danswer can't extract the text from these files. I will need to setup a similar debug like yours to see what's my problem is.
I tested on my side, and one of my problem is it usually hit this error
[{'message': 'This file is too large to be exported.', 'domain': 'global', 'reason': 'exportSizeLimitExceeded'}]
Turns out when exporting the Google spreadsheet, the exported file size is usually big. I found one relevant article to go around this limitation https://stackoverflow.com/questions/40890534/google-drive-rest-api-files-export-limitation , I'll try to test it to see if it works.
Fixes #1664 by exporting the presentation as pptx file.