e2b-dev / E2B

Secure open source cloud runtime for AI apps & AI agents
https://e2b.dev/docs
Apache License 2.0
7.02k stars 458 forks source link

Add multi-file write support to the js and python sdks #451

Open 0div opened 1 month ago

0div commented 1 month ago

Description

Test

# Test js-sdk
cd packages/js-sdk
pnpm test

# Test python-sdk
cd packages/python-sdk
pnpm test
linear[bot] commented 1 month ago

E2B-954 Add a way to upload several files to a sandbox at once

changeset-bot[bot] commented 1 month ago

⚠️ No Changeset found

Latest commit: 86262f1bc62183b0ac969290463b1e5cc06587e6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

ValentaTomas commented 1 month ago

I pushed an edit that should sketch a type cleanup.

There are two unfinished parts:

  1. It should be possible to typeguard this so there is no type assumption when dealing with the parameters.

    
    const { path, writeOpts, writeFiles } = typeof pathOrFiles === 'string'
      ? { path: pathOrFiles, writeFiles: [{ data: dataOrOpts }], writeOpts: opts }
      : { path: undefined, writeFiles: pathOrFiles, writeOpts: dataOrOpts }
    
    const blobs = await Promise.all(writeFiles.map(f => new Response(f.data).blob()))

2. The problem here was actually there even in the original SDK code — the `EntryInfo` type assumption works because the returned types are exactly of the shape of `EntryInfo`, but I think we should property type these by creating a mapping. Also, this will not correctly return the array if I just passed an array with one file.
```ts
    return files.length === 1 ? files[0] : files

Some edits I did:

Also, your previous code was mostly correct; this is more of an improvement.

ValentaTomas commented 1 month ago

I also think the path might be redundant as an API argument here because we can just send the file with the filename as usual. (I read the response in the envd infra PR, so not that sure here.)

Let's keep it in the envd API though, because one of the uses was that you can use it to ensure that the path people upload files to is fixated.

0div commented 1 month ago

@ValentaTomas I addressed your comments and added extra tests for some edge cases.

0div commented 1 month ago

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()
ValentaTomas commented 1 month ago

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

0div commented 1 month ago

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

If it's really important not to break the current API spec, we could add a custom field to the multipart dataform and look for it with some changes to envd

0div commented 1 month ago

@ValentaTomas I started by adding multi file write support for sandbox_sync to get your opinion on this approach before moving forward with _async.

Considerations

I went the OR route when it comes to typing, as function overload is not natively supported in python. Thate being said, with some non-negligible overhead, it is kindof achievable, but im not enthralled by the idea, especially when thinking about previous work generating the api-ref automatically: i have a feeling it wouldn't play well with it.

ValentaTomas commented 1 month ago

For the overload I think you can use the same system as we already have with https://github.com/e2b-dev/E2B/blob/beta/packages/python-sdk/e2b/sandbox_sync/process/main.py#L106

It should be the same thing, right?

ValentaTomas commented 1 month ago

I also suggest naming and exporting all the types (from both SDKs). What do you say about having:

ValentaTomas commented 1 month ago

I'm thinking about what to do when you try to invoke write for multiple files and provide an empty array.

Logically, you might want to notify the user that nothing was written, but throwing an error might not be optimal. If you are generating the field to write, you need to explicitly check if the array is empty; otherwise, you will get an error.

In contrast to this, isn't writing 0 files a valid operation and you will also get an array with 0 results so everything is ok?

ValentaTomas commented 1 month ago

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

If it's really important not to break the current API spec, we could add a custom field to the multipart dataform and look for it with some changes to envd

Yeah, I'm thinking that we should probably do this, because people are already using the Beta SDK.

0div commented 1 month ago

I'm thinking about what to do when you try to invoke write for multiple files and provide an empty array.

Logically, you might want to notify the user that nothing was written, but throwing an error might not be optimal. If you are generating the field to write, you need to explicitly check if the array is empty; otherwise, you will get an error.

In contrast to this, isn't writing 0 files a valid operation and you will also get an array with 0 results so everything is ok?

From the perpective that there will likely be less control over which files, if any, are generated, your point makes sense. I will allow empty arrays

mlejva commented 1 month ago

@ValentaTomas @0div

  1. I left a couple of notes about the DX.
  2. It's not completely clear to me how the usage looks like, @0div can you please share code snippets for JS and Python showing how to use the new API? I think especially in Python naming of these parameters in the write method is confusing
        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,
0div commented 1 month ago

@ValentaTomas @0div

  1. It's not completely clear to me how the usage looks like, @0div can you please share code snippets for JS and Python showing how to use the new API? I think especially in Python naming of these parameters in the write method is confusing
        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,

js-sdk

// legacy single-file write method signature
sandbox.files.write("path/to/file.txt", "This is a test file")

// new overloaded multi-file write method signature:
sandbox.files.write([
  { path: 'test_file_1.txt', data: 'This the first test file.' },
  { path: '/test_dir/test_file_2.txt', data: 'This the second test file.'},
])

see js-sdk test for more examples

python-sdk (sync & async)

# legacy single-file write method signature
sandbox.files.write("path/to/file.txt", "This is a test file")

# new overloaded multi-file write method signature:
sandbox.files.write([
  { "path": 'test_file_1.txt', "data": "This the first test file." },
  { "path": '/test_dir/test_file_2.txt', "data": "This the second test file."},
])

see python-sdk test for more examples

mlejva commented 1 month ago

I added @jakubno as a reviewer to check the Python SDK as he's the one with the most Python experience on our team

mlejva commented 1 month ago

I would argue for splitting the methods, there are two different functions for the user. I think 90% of users will use the simple version for which I like the current signature (no need for custom objects). The only downside is, there will be 2 ways for users to upload data. But if we name this multifile_upload or multifile_write it shouldn't be confusing

I think we could even have just the multi-file support in both SDKs because if you want to write a single if, it's just an array with a single item. The question in my head is whether it's better DX of having a single method that has a clear arguments but user needs to always pass an array, even for a single file. Or if it's better to have separate functions for single file and multiple files write.

Alternatively, if we want to go Kuba's road with separate functions we could have:

  1. sandbox.files.write() - for a single file
  2. sandbox.files.write_multiple() - for multiple files Same in JS?
jakubno commented 1 month ago

I usually don't like the passing arrays.

When you pass an array the function will return an array and will it have only one element? should I check it has the element or should I risk index out of bounds error?

mlejva commented 1 month ago

One more thing I want to discuss - WriteData in JS SDK.

The WriteData type is this:

export type WriteData = string | ArrayBuffer | Blob | ReadableStream

We're using it for example here:

export type WriteEntry = {
  path: string
  data: WriteData
}

WriteEntry is a single item in multi file write - sandbox.files.write([{ path: string, data: WriteData }])

I'm thinking of maybe we should get rid of this type alias and have it directly as this:

export type WriteEntry = {
  path: string
  data: string | ArrayBuffer | Blob | ReadableStream
}

The reason would be to make it easier for users to immediately see what data is in their type hints. Otherwise they need to do additional clicks and find what WriteData is in our codebase.

mlejva commented 1 month ago

I usually don't like the passing arrays.

When you pass an array the function will return an array and will it have only one element? should I check it has the element or should I risk index out of bounds error?

That's a good argument. I agree. I don't like dealing with arrays and checking the length as well

ValentaTomas commented 1 month ago

I also like the single upload method because it is something users already use a lot; requiring to always pass an array didn't sit well with me.

For splitting the methods — I really don't want to end up with more methods for the same thing on the same level.

export type WriteEntry = { path: string data: string | ArrayBuffer | Blob | ReadableStream }

I think this is the better solution here than the WriteData I originally proposed.

ValentaTomas commented 1 month ago

I usually don't like the passing arrays. When you pass an array the function will return an array and will it have only one element? should I check it has the element or should I risk index out of bounds error?

That's a good argument. I agree. I don't like dealing with arrays and checking the length as well

For the multifile upload, you have to pass an array though, right?

EDIT: If I understand it correctly, this is an argument for having the single file upload there, right

mlejva commented 1 month ago

I think this is the better solution here than the WriteData I originally proposed.

It would be great if we adopt this approach in general. With Typescript it's a thin line between using types to hide some abstraction and over-using types.

mlejva commented 1 month ago

For splitting the methods — I really don't want to end up with more methods for the same thing on the same level.

I can maybe see us using the overrides in JS but I really don't like this write signature in Python:

path_or_files: str | List[WriteEntry],
data_or_user: WriteData | Username = "user",
user_or_request_timeout: Optional[float | Username] = None,

It's not clear at all how to use these methods just from looking at the code. We should really aim for trying to be as clear as possible just from the code.

It creates many questions in my had, why is data and user mixed together? Why is user and request timeout mixed together? It's very unintuitive

ValentaTomas commented 1 month ago

Just to clear up a possible confusion here, because when I checked the start of the discussion it might not be clear:

        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,

This signature should never be seen by the user, because the method is @overloaded, so the visible method signatures will be:

  1. This will be default and selected first if you start typing:

    @overload
    def write(
        self,
        path: str,
        data: WriteData,
        user: Username = "user",
        request_timeout: Optional[float] = None,
    ) -> EntryInfo:
  2. You can switch to this one explicitly or implicitly by passing an array (because of duck typing) as first argument or files argument.

    @overload
    def write(
        self,
        files: List[WriteEntry],
        user: Optional[Username] = "user",
        request_timeout: Optional[float] = None,
    ) -> List[EntryInfo]:
mlejva commented 1 month ago

Just to clear up a possible confusion here, because of the start of the discussion:

        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,

This signature should never be seen by the user, because the method is @overloaded, so the visible method signatures will be:

  1. This will be default and selected first if you start typing:
    @overload
    def write(
        self,
        path: str,
        data: WriteData,
        user: Username = "user",
        request_timeout: Optional[float] = None,
    ) -> EntryInfo:
  1. You can switch to this one explicitly or implicitly by passing an array (because of duck typing) as first argument or files argument.
    @overload
    def write(
        self,
        files: List[WriteEntry],
        user: Optional[Username] = "user",
        request_timeout: Optional[float] = None,
    ) -> List[EntryInfo]:

I see. I must have missed the @overload, sorry. What does the user option do? Is that new?

I'll leave @jakubno to add any notes if has anything additional. As I said - he has the most Python experience

jakubno commented 1 month ago

This signature should never be seen by the user, because the method is @overloaded, so the visible method signatures will be:

It ain't true, users can see the source code for python libraries and I often go there

ValentaTomas commented 1 month ago

The user is an optional argument that can be used to modify who you are making the filesystem operation as — if you need to delete a file owned by root or create a file that is owned by a different user that the default one.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

EDIT: It is new in a sense that it is on all relevant methods in the Beta SDK.

ValentaTomas commented 1 month ago

This signature should never be seen by the user, because the method is @OverLoaded, so the visible method signatures will be:

It ain't true, users can see the source code for python libraries and I often go there

Well, I cannot argue with that. At that point isn't the usage clear?

EDIT: What I wanted to say — are you worried about the overload implementation being confusing to people inspecting the code?

mlejva commented 1 month ago

The user is an optional argument that can be used to modify who you are making the filesystem operation as — if you need to delete a file owned by root or create a file that is owned by a different user that the default one.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

I'm pretty sure this won't solve it. People will be still confused and won't know what to pass there. The solution is that users shouldn't need to mess with permissions by default. The filesystem should be accessible to the default user. I had very relevant feedback for this from our hackathon:

Problem uploading data to sandbox and not being able to open the file because of permissions. Had to use sudo. Everything was in /home/user.

0div commented 1 month ago

Would the pythonic *args (Non-Keyword Arguments) be a good middle-ground here? And its equivalent in TS function sum(...args) {

That way a dev can use one or more args or w/o using arrays.

ValentaTomas commented 1 month ago

Would the pythonic *args (Non-Keyword Arguments) be a good middle-ground here? And its equivalent in TS function sum(...args) {

That way a dev can use one or more args or w/o using arrays.

I think it would be slightly problematic (more so in TS) because we also want to pass the opts for the request.

mlejva commented 1 month ago

plus don't we lose all the type hints?

ValentaTomas commented 1 month ago

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method.

Yes, it is on all relevant methods.

By default you don't need to mess with this, but when interacting with filesystem, not having this exposed would lead you to not being able to do this. So that's why I choose to do less magic here — the user you pass is used for the operation, there is default of user that is being user, but you can override it.

mlejva commented 1 month ago

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method. Yes, it is on all relevant methods.

By default you don't need to mess with this, but when interacting with filesystem, not having this exposed would lead you to not being able to do this. So that's why I choose to do less magic here — the user you pass is used for the operation, there is default of user that is being user, but you can override it.

What are the use cases when you need it? It's a simple parameter but it opens a whole can of new questions like "how do I create a new user?" or "how do I switch to a different user when running a command?" that we'll need to provide answers for. That's why I'm not a big fan of this. It feels only half way there because it's missing all the technical content and utilities around it.

ValentaTomas commented 1 month ago

plus don't we lose all the type hints?

The variadic argument could be typed, but generally, it should be the last argument. Otherwise, it won't work.

mishushakov commented 1 month ago

Sorry to get in between, but for the Browserbase SDK I was in a similar situation, we ended up with a load_url method to load a single web page, load_urls to load multiple and a load method to handle both (depending on input)

mlejva commented 1 month ago

Sorry to get in between, but for the Browserbase SDK I was in a similar situation, we ended up with a load_url method to load a single web page, load_urls to load multiple and a load method to handle both (depending on input)

Thanks for the input. Would you do it the same way again?

I think I'm leaning increasingly more towards two separate methods. I'm not sure about having a third one to potentially handle both single and multi use case.

mishushakov commented 1 month ago

Would do two methods. My thinking was (still is): I want a function to return X. If a function should return Y, I make a function that returns Y. A function that returns both X and Y is flaky

ValentaTomas commented 1 month ago

@mishushakov Why you ended up with three in the end?

mishushakov commented 1 month ago

Because it felt like if we have the other two why not have it also. But tbh., the whole thing was out of necessity, because I didn't want to start a browser every time I want to fetch a URL, load_urls would connect once and do the same thing as load for multiple urls. I actually prefer the Node way - you can files.map > fs.writeFile > Promise.all(files)

ValentaTomas commented 1 month ago

Three sounds the worst :D

Let's do the two separate if everybody agrees:

mlejva commented 1 month ago

Three sounds the worst :D

Let's do the two separate if everybody agrees:

  • write_file/writeFile
  • write_files/writeFiles

I agree about separate methods but not sure about the naming because there's already files module:

sandbox.files.writeFile() / sandbox.files.write_file()
sandbox.files.write_files() / sandbox.files.write_files()

Maybe feels a little weird?

ValentaTomas commented 1 month ago

Alternative namespace solutions:

EDIT: I actually feel that maybe not having namespaces lowers the mental difficulty for me when using the SDK — you just type what you want to do as opposed to having to think in which namespace the method should be.

EDIT 2: Also I would not even argue for overloaded write method without the files namespace, because without namespaces you need to name it more specifically like writeFile which tells you that it takes just one file.

0div commented 1 month ago

Alternative namespace solutions:

  • sandbox.fs.write_file (Not sure if it is clear that you should use this for filesystem)
  • sandbox.filesystem.write_file (This seems too long)
  • sandbox.write_file (Can we get rid of all namespaces without making everything confusing?)

EDIT: I actually feel that maybe not having namespaces lowers the mental difficulty for me when using the SDK — you just type what you want to do as opposed to having to think in which namespace the method should be.

EDIT 2: Also I would not even argue for overloaded write method without the files namespace, because without namespaces you need to name it more specifically like writeFile which tells you that it takes just one file.

Keeping a namespace or prefixing the method name could help disambiguate a bit since that kinduv io touches other linux abstractions.

mishushakov commented 1 month ago

I'm going with no namespaces. With Stripe SDK it makes sense to have namespaces as they have different "products" and the products have different methods available. We only have one "product" (or class) which is the sandbox, so we should only have sandbox methods on top-level, eg: sandbox.writeFile.

0div commented 1 month ago

I'm going with no namespaces. With Stripe SDK it makes sense to have namespaces as they have different "products" and the products have different methods available. We only have one "product" (or class) which is the sandbox, so we should only have sandbox methods on top-level, eg: sandbox.writeFile.

The question is then is this a single product API or a VM (with its components) API?