Closed rjanjua closed 5 months ago
Thanks for this @rjanjua! One clarification question I wanted to ask is what does the lt 20
mean in the title?
Thanks for this @rjanjua! One clarification question I wanted to ask is what does the
lt 20
mean in the title?
@sjrl Ah sorry, updated the title to make more sense - lt = less than.
Ah sorry, updated the title to make more sense - lt = less than
Thanks! Does this mean that the upload of DeepsetCloudFile and DeepsetCloudFileBytes only works when you are uploading less than 20 at a time?
Ah sorry, updated the title to make more sense - lt = less than
Thanks! Does this mean that the upload of DeepsetCloudFile and DeepsetCloudFileBytes only works when you are uploading less than 20 at a time?
uploading non text files already works for batches larger than 20 files. When doing less than 20 files we use direct uploads via the API, but for larger batches we use the upload sessions. This was implemented in a previous PR, but it seems like the case of less than 20 files was missed somehow.
Thanks for the information.
To follow-up on this point.
uploading non text files already works for batches larger than 20 files.
I'm still a little confused. Could you provide a code snippet of how when using a batch size larger than 20 files that we could directly upload using DeepsetCloudFiles? Or are you saying the DeepsetCloudFiles is only used when doing direct uploads?
I guess I'm coming from the perspective of using the the python function directly and not the CLI, so from what I can tell your function upload_bytes
would allow me to upload an arbitrary number of DeepsetCloudFileBytes, which wasn't possible before right?
I created the class DeepsetCloudFileBytes only because the DeepsetCloudFile class has a field called 'text'. Technically you can populate that field with a bytes object if you so wish, but the naming is confusing. This is why I created the new class with a bytes field instead of a text field. This makes it clear to the user which to use where, even though the underlying way both are handled is essentially the same.
Previously the reason you couldn't upload a single pdf file, for example, was because the direct upload was explicitly checking that files were text files before uploading them, now we allow any file with an extension that is allowed in dC to be uploaded.
However, when uploading batches of files greater than 20 files, we were uploading via upload sessions, which was checking that the extension was one of the allowed dC extensions. For this reason you could have use upload_texts to upload a pdf file, but only if it was in a batch of more than 20 files.
To reiterate, I created the DeepsetCloudFileBytes class, and upload_bytes so that it makes more sense to a user when uploading non text files, even though it was technically possible to di this with upload_texts for more than 20 files.
Ahh okay, thanks a lot for the in depth explanation @rjanjua!
Ahh okay, thanks a lot for the in depth explanation @rjanjua!
If you want to see a demonstration of what is happening in the problem version of the API check out this colab, if you remove 1 element from the list of files you will see that it does not work, but with 21 files it does.
Click to see where and how coverage changed
File Statements Missing Coverage Coverage
(new stmts)Lines missing
deepset_cloud_sdk
models.py
34, 41, 83-84, 92
deepset_cloud_sdk/_api
files.py
deepset_cloud_sdk/_s3
upload.py
deepset_cloud_sdk/_service
files_service.py
deepset_cloud_sdk/_utils
constants.py
deepset_cloud_sdk/workflows/async_client
files.py
272-273
deepset_cloud_sdk/workflows/sync_client
files.py
198
Project Total
This report was generated by python-coverage-comment-action
Related Issues
Proposed Changes?
Add a new function in the sync and async clients -
upload_bytes
, that allows the upload of any supported formats. Updated some internal implementations of the sdk to enable this without replicating the same code for different formats.How did you test it?
Notes for the reviewer
Screenshots (optional)
Checklist