feathersjs-ecosystem / feathers-blob

Feathers service for blob storage, like S3.
http://feathersjs.com
MIT License
92 stars 32 forks source link

Refactoring proposal based on AWS SDK and presigned URLs #96

Open claustres opened 1 year ago

claustres commented 1 year ago

feathers-blob currently relies on a model conforming to the abstract-blob-store interface. Most of the time it appears feathers-blob is used to store data in cloud object storages, e.g. AWS S3 or Google Cloud Storage, thus relying on e.g. https://github.com/jb55/s3-blob-store or https://github.com/maxogden/google-cloud-storage under-the-hood, which do not seem to be maintained anymore. For instance the S3 module uses https://github.com/nathanpeck/s3-upload-stream as a dependency, which is now deprecated for a long time.

Now most cloud providers provides a similar interface for their Object Storage as Amazon S3 does, we wonder if refactoring feathers-blob directly using an up-to-date version of the AWS SDK wouldn't be the most relevant.

Moreover, a lot of issues indicate that file uploading/downloading is still a challenging task for most people who have to understand a lot of concepts like blobs, data uris, multipart support, the different client/server responsibilities, etc. In order to simplify this we could also rely on presigned URLs.

At Kalisio we have already started an effort with something able to replace feathers-blob based on this proposal. It looks like this:

Let us know what you think about that, notably if feathers-blob handles others use cases that would not benefit from this proposal. Otherwise, any help is welcome to upgrade this module.

fratzinger commented 1 year ago

Oh yes please!!! We have two instances. One legacy server is running on a windows vm and thus we use fs with feathers-blob. But the newer server runs on aws with AWS S3. We use https://github.com/jb55/s3-blob-store with multer which really is a pain because it uses too much memory. We need to switch to presigned URLs asap. But we also need to support the legacy windows fs-blob-store but I can get around that with app.get/app.set easily.

I'm highly interested in your workflow! If you can share something in the short run it would be highly appreciated! Even maybe privately and maybe we can work out a way to make it more generic and publicly available. I would love to see this baked into feathers-blob somehow!

We have a database table that stores additional information to uploads. In fact we use a public /uploads service that redirects to internal /uploads-blobs and /uploads-db depending on the request. We also use https://github.com/fratzinger/feathers-casl to decide who has access to which uploads and operations.

DaddyWarbucks commented 1 year ago

I like where this is going! And I agree that upload/download is difficult in feathers for all the reasons listed. Feathers is a JSON server and 99% of its tools, ecosystem, community are driven around that so trying to fit in multipart, etc is difficult. And I agree that most users are using some 3rd party service, and anyone not using a third party service is probably familiar enough with these concepts to comfortably roll their own solution. But, with that said, this rewrite may deserve to be a totally new package because it is so different.

I have not used feathers-blob in the past and have used my own implementation with Cloudinary. So my only concern with this new approach would be its tight coupling with AWS. So something more generic would be better IMO. Maybe something like

const service = blobService({
  sign: (params) => {
    cloudninary.sign(params);
    // aws.sign(params);
    // somethingElse.sign(params);
  },
  upload: (signedUrl, base64) => {
    cloudninary.upload(signedUrl, base64);
    // aws.upload(signedUrl, base64);
    // somethingElse.upload(signedUrl, base64);
  },
 download: (signedUrl) => {
    cloudninary.download(signedUrl);
    // aws.download(signedUrl);
    // somethingElse.download(signedUrl);
  }
})

app.use('/uploads', service)

That leaves all the provider setup and config to the user.

claustres commented 1 year ago

@DaddyWarbucks Thanks for feedback. Using the AWS SDK does not mean a tigh coupling to AWS as their S3 API has becoming a defacto standard. We already use the AWS SDK to connect to three different providers on our side: AWS, OVH and Scaleway. Currently OVH does not yet support CORS for instance, so that a proxy is required on your app. Any feedback on AWS SDK compatibility with others cloud providers is welcomed.

I agree that it might be better to start a whole new module. A the present time our implementation is really simple and read the file content as a buffer then send it to the signed URL on the frontend (or backend if using a proxy). So we read the entire content in memory, which is not appropriate for large files, but is probably sufficient to most use cases such as image/docs upload. Any insight on using presigned URL and multipart upload is also welcomed to improve this.

We will soon share a skeleton of our implementation.

cnouguier commented 1 year ago

In the futrue, a potential solution to implement multipart upload is presented in this article https://blog.logrocket.com/multipart-uploads-s3-node-js-react/

mdartic commented 1 year ago

Thanks for your proposal.

Actually, we are using feathers-blob with either a file system or s3-based storage (and we use aws-sdk under the hood too). We use minio as we host all of our infra with open source softwares. (if interested, code is available here)

The backend part is quite easy, thanks to this package. We only need to choose the fs/s3 storage depending the config the user set in our project.

The frontend is not so nice, as I think we don't send data directly to the provider in case of s3, but we send it first to the API. We don't use presigned URL, it could be easier to do. We can do better.

We also use a proxy mechanism with nginx + feathers to protect / check the ability the user can download a s3 object. If the user have the authorization from the API, we let the user access the s3 object.

We think we can do better, the proxy routes upload/download are an interesting way, and clearly more robust / secure than our nginx implementation.

If the package could still be compatible with a file system, that would be nice !

claustres commented 1 year ago

I think that willing to create a module handling all use cases will probably fail due to the underlying complexity. Relying on presigned URLs is really a different beast than relying on a fs storage. The whole workflow is actually different.

We plan to share soon a feathers-s3 module specifically focused on this to avoid any confusion. It could then be possible to still use feathers-blob beside to store data in a local file whenever required.

claustres commented 1 year ago

Maybe an option to support filesystem-based storage with the same API is to use https://github.com/minio/minio. Any feedback on this tool would be useful (does it work fine with the AWS SDK, does it support presigned URLs, etc. ?).

bil-ash commented 1 year ago

I think that willing to create a module handling all use cases will probably fail due to the underlying complexity. Relying on presigned URLs is really a different beast than relying on a fs storage. The whole workflow is actually different.

We plan to share soon a feathers-s3 module specifically focused on this to avoid any confusion. It could then be possible to still use feathers-blob beside to store data in a local file whenever required.

Looking forward to feathers-s3 because current feathers-blob is indeed problematic. I plan to use minio on a storage VPS(which is not used much now) because minio does support presigned URLs https://min.io/docs/minio/linux/integrations/presigned-put-upload-via-browser.html

fratzinger commented 1 year ago

I've started working on https://github.com/Artesa/feathers-file-stream

It's not designed to be using presigned urls at all but it uses streams.

Prerequisites

I'm curious, if you're not working with presigned urls yet, how do you handle files with feathers-blob? I use express with multer v1 with memoryStorage which is the first bottleneck. Even if I use diskStorage, then feathers-blob needs the data as Buffer or uri. Both are in memory, I guess.

Do you know a way not to store the files in memory?

@artesa/feathers-file-stream

Yesterday I learned about multer@next which creates a ReadableStream for the files. Using a stream reduces the memory consumption. But how to work with the stream on the feathers layer (hooks) and pass it to s3 or elsewhere. @artesa/feathers-file-stream does exactly that. It takes a stream and uploads it. Currently fs and @aws-sdk/client are supported.

It's not bound to multer. Really anything that produces a stream can work. I would love to see a koa version or even a socket.io version to be transport independent. I also would love to see a minio service.

It's at really early stage and not even published to npm yet. But seeing this valuable discussion here I'm more than curious what you peeps think about.

Unless it's not using presigned urls, do you think this is a thing worth investigating more? Or am I going the wrong way?

claustres commented 1 year ago

Dear community and contributors to this issue @fratzinger @bil-ash @mdartic @DaddyWarbucks, we are pleased to introduce you to https://github.com/kalisio/feathers-s3. Before releasing a first version we welcome feedback from anyone willing to test.

We hope it will provide a more flexible and reliable way to manage upload/download, of possibly large files, in a Feathers app.

bil-ash commented 1 year ago

Great work. Now waiting to get a big storage VPS as part of black friday deals,set up minio on it and then use feathers-s3

claustres commented 1 year ago

@bil-ash Would be great to have a feedback to somewhat "replace" filesystem storage with minio + feathers-s3.

palmtown commented 1 year ago

Hello Community,

Thank you for the work you've done on feathers-blob, it has been helpful in getting my development off the ground. Recently, as shown in issue https://github.com/feathersjs-ecosystem/feathers-blob/issues/97 I discovered that uploading to S3 using AWS v3 SDK and feathers-blob triggers an error as described therein.

Having stated that, I would recommend "refactoring feathers-blob directly using an up-to-date version of the AWS SDK" as I believe it would "be the most relevant." Why? As you stated, "most of the time it appears feathers-blob is used to store data in cloud object storages, e.g. AWS S3." This is further proof, in short, that feathers-blob has evolved. In my use case, it's a one-stop solution for many storage options which allows me to use one package that solves many requirements.

In conclusion, why "wouldn't" feathers-blob continue to evolved and why not now?

Thank you.

palmtown commented 1 year ago

Hello Community,

In addition to my last reply, I did take a look at https://github.com/kalisio/feathers-s3 and it's a good solution for certain use cases. The keywords being "certain use cases." feathers-blob provides many solutions in-one, alike to an all-in-one printer that has print, scan, copy, fax etc, whereas feathers-s3 is simply print. Nothing at all wrong with that, however, as mentioned, it's for certain use cases.

Moreover, I read many of the comments. All good dialog. Whatever the outcome, great job thus far!

claustres commented 1 year ago

Hi @palmtown and thanks for feedback, it is always good to know about use cases. IMHO feathers-s3 does the same things that you can do with feathers-blob, when using an object storage, but in a better and in a more flexible way. You can use "single part upload" to transfer a whole object at once (like feathers-blob does) but you can also use "multi part upload" to split large objects in parts and avoid using too much memory (unlike feathers-blob). Both modes can work using your backend as a proxy (like feathers-blob does) or directly between your frontend and the object storage (unlike feathers-blob).

So the key difference for me is the underlying interface we support:

As a consequence feathers-s3 has a more narrow field of application, I assume that's what you are referring when saying "certain use cases". Typically, you will not be able to directly read/write from/to a file system or a google drive. https://github.com/minio/minio could be a solution for the former (would love to hear some people about it), but we have yet no solution for the second one.

Refactoring feathers-blob directly using an up-to-date version of the AWS SDK will mean deprecate the abstract blob store interface. In that case, I would say it's already done and called feathers-s3. The main interest I would see in keeping feathers-blob up-to-date is to maintain the abstract blob store interface, but in that case what should be refactored directly using an up-to-date version of the AWS SDK should be the S3 blob store modules not feathers-blob (again feathers-blob does not use the AWS SDK except in tests as it relies on the abstract blob store interface). On my side I can see two key problems about this:

palmtown commented 1 year ago

Hello @claustres ,

Thank you for your response. It makes vivid the dynamics involves in determining the best possible decision. The key difference for me, which you've already mentioned, is that "feathers-s3 has a more narrow field of application." Specially, feather-blob affords me the luxury of not having to implement https://github.com/minio/minio, while maintaining the ability to perform local operations as well as cloud base (e.g. AWS S3)--simple and powerful.

As I understand, feathers-blob provides a store abstraction, in addition to integrating with other stores (e.g. AWS SDK v2). On the other hand, feathers-s3 is a client (per se) that will connect to services leveraging an S3 compatible API--no store abstraction. Having this knowledge, and in my case, I need both, a local store, as well as accessing AWS S3, under one feathers service--feathers-blob does this now.

In response to "IMHO feathers-s3 does the same things that you can do with feathers-blob," I agree and disagree. I agree it provides a client features for accessing other stores, however, I disagree that it provides a store, hence your mentioning of "when using an object storage".

One disadvantage of using another object store such as https://github.com/minio/minio is introducing an additional point of failure. MinIO runs as a standalone server. Currently, feathers-blob is integrated into feathersjs and our current monitoring and failover blankets feathers-blob as part of the larger feathersjs deployment--no standalone environment needed.

Additionally, while multi-part uploads is a great feature, and fits other use cases, it is not a feature I need,. Reason being is that my files are small.

Having stated that, other concerns come about. You commented that "most of the time it appears feathers-blob is used to store data in cloud object storages, e.g. AWS S3 or Google Cloud Storage, thus relying on e.g. https://github.com/jb55/s3-blob-store or https://github.com/maxogden/google-cloud-storage under-the-hood, which do not seem to be maintained anymore." The concern is that they "do not seem to be maintained anymore." I'm dealing with npm dependency errors right now, and don't need any more. This is where I believe feathres-s3 comes in.

What do you think about replacing the aforementioned packages with feathers-s3. Meaning, integrating both feathers-s3 and feathers-blob. @DaddyWarbucks shield some light on one approach. The goal would be to keep the store abstraction which is working great, yet, provide a way to extend feathres-blob so that other packages like feathers-s3 can be integrated.

claustres commented 1 year ago

Dear @palmtown, I think it could be a good idea to keep feathers-blob up-to-date w.r.t. AWS SDK by relying on feathers-s3 under the hood as it seems to provide some benefits that feathers-s3 alone cannot provide. I will try to dedicate some time working on this but without any ETA as I currently have others priorities, so any help is welcome to speed-up things. As a first step we should make the feathers-s3 service implements the abstract blob store interface and add the associated test harness. Then, we could update the S3 tests in feathers-blob to ensure it works as expected.

claustres commented 1 month ago

So far no progress on this, I propose to close the issue unless somebody else would like to tackle it.

From our point of view https://github.com/kalisio/feathers-s3 is now full-featured with presigned URLs, multi-part upload, etc. that makes the compliance to the abstract-blob-store interface too much constrained and will only work for specific use cases. Moreover, almost all blob store modules seem to not be maintained anymore. We don't get any feedback on using feathers-s3 with https://github.com/minio/minio but we believe it should provide the alternative for people that do not want to rely on external/remote cloud services. Last but not least, the current way of deploying web apps / services with tools like managed k8s clusters makes less relevant the use of a low-level file-based system for apps.