deepset-ai / deepset-cloud-sdk

A Python SDK to interact with deepset Cloud
Apache License 2.0
9 stars 2 forks source link

Fix: enable non .txt file format uploads when using direct upload #192

Closed rjanjua closed 5 months ago

rjanjua commented 5 months ago

Related Issues

Proposed Changes?

Add a new function in the sync and async clients - upload_bytes, that allows the upload of any supported formats. Updated some internal implementations of the sdk to enable this without replicating the same code for different formats.

How did you test it?

Notes for the reviewer

Screenshots (optional)

Checklist

sjrl commented 5 months ago

Thanks for this @rjanjua! One clarification question I wanted to ask is what does the lt 20 mean in the title?

rjanjua commented 5 months ago

Thanks for this @rjanjua! One clarification question I wanted to ask is what does the lt 20 mean in the title?

@sjrl Ah sorry, updated the title to make more sense - lt = less than.

sjrl commented 5 months ago

Ah sorry, updated the title to make more sense - lt = less than

Thanks! Does this mean that the upload of DeepsetCloudFile and DeepsetCloudFileBytes only works when you are uploading less than 20 at a time?

rjanjua commented 5 months ago

Ah sorry, updated the title to make more sense - lt = less than

Thanks! Does this mean that the upload of DeepsetCloudFile and DeepsetCloudFileBytes only works when you are uploading less than 20 at a time?

uploading non text files already works for batches larger than 20 files. When doing less than 20 files we use direct uploads via the API, but for larger batches we use the upload sessions. This was implemented in a previous PR, but it seems like the case of less than 20 files was missed somehow.

sjrl commented 5 months ago

Thanks for the information.

To follow-up on this point.

uploading non text files already works for batches larger than 20 files.

I'm still a little confused. Could you provide a code snippet of how when using a batch size larger than 20 files that we could directly upload using DeepsetCloudFiles? Or are you saying the DeepsetCloudFiles is only used when doing direct uploads?

I guess I'm coming from the perspective of using the the python function directly and not the CLI, so from what I can tell your function upload_bytes would allow me to upload an arbitrary number of DeepsetCloudFileBytes, which wasn't possible before right?

rjanjua commented 5 months ago

I created the class DeepsetCloudFileBytes only because the DeepsetCloudFile class has a field called 'text'. Technically you can populate that field with a bytes object if you so wish, but the naming is confusing. This is why I created the new class with a bytes field instead of a text field. This makes it clear to the user which to use where, even though the underlying way both are handled is essentially the same.

Previously the reason you couldn't upload a single pdf file, for example, was because the direct upload was explicitly checking that files were text files before uploading them, now we allow any file with an extension that is allowed in dC to be uploaded.

However, when uploading batches of files greater than 20 files, we were uploading via upload sessions, which was checking that the extension was one of the allowed dC extensions. For this reason you could have use upload_texts to upload a pdf file, but only if it was in a batch of more than 20 files.

To reiterate, I created the DeepsetCloudFileBytes class, and upload_bytes so that it makes more sense to a user when uploading non text files, even though it was technically possible to di this with upload_texts for more than 20 files.

sjrl commented 5 months ago

Ahh okay, thanks a lot for the in depth explanation @rjanjua!

rjanjua commented 5 months ago

Ahh okay, thanks a lot for the in depth explanation @rjanjua!

If you want to see a demonstration of what is happening in the problem version of the API check out this colab, if you remove 1 element from the list of files you will see that it does not work, but with 21 files it does.

github-actions[bot] commented 5 months ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  deepset_cloud_sdk
  models.py 34, 41, 83-84, 92
  deepset_cloud_sdk/_api
  files.py
  deepset_cloud_sdk/_s3
  upload.py
  deepset_cloud_sdk/_service
  files_service.py
  deepset_cloud_sdk/_utils
  constants.py
  deepset_cloud_sdk/workflows/async_client
  files.py 272-273
  deepset_cloud_sdk/workflows/sync_client
  files.py 198
Project Total  

This report was generated by python-coverage-comment-action