Using temp files in new api is not the right approach

devconcept commented 7 years ago

The new v2 api makes some breaking changes to the way the old api used to work. One of the most notable changes are the removal of the storage engines and the exposure of a stream to be manipulated directly.

Although this seem like and advance in my opinion this is actually a step back from the flexibility the v1 offers (at least in the way is drafted now).

I found some problems in how the files are handled that starts with the use of the fs-temp module as an intermediary to handle files.

Here is a small summary of the side effects of this decision.

If I want to store a file in a given folder now I have to manually write all the code required for this because I have no way to configure it. This was trivial in v1. The extra work required seems a lot compared to just writing a string with a path.
I now have an extra file in my filesystem. This is bad for several reasons. Before, I took the file and wrote directly from the request to the destination, now I write them in a temporary folder and then move it to the destination potentially requiring a lot more effort in storage space and processing than before. It is also not intuitive because is happening under the hoods without the user knowing that he should probably check the default os.tmpdir() folder for space.
The temporary files will be left in the filesystem. A lot of developers doesn't know how to properly deal with streams or files; even when they do, they probably forget to delete the copy after piping the file. This only adds to the extra work required to simply "store a file".
If I don’t want to use the filesystem as my storage I still need to wait for all the I/O to finish before I can store my files elsewhere (the cloud, a database, etc.). Imagine this for a lot of large files.

Summary, this change consumes a lot of resources just to be able to tell you the file size and the mime type. Maybe I'm missing something here but there might be another way to deal with this that does not imply stripping the api of features.

I think that "storing a file" should not be more difficult than just create the stream, handling the success and the error and calling pipe.

Maybe a way to tackle this problem is to accept a writable stream per file and pipe them automatically or expose busboy streams for consumption directly but definitely writing an extra temporary file inside a fixed folder will cause more problems than solutions.

I wish the new api looks more like this

const upload = multer('/uploads');

or maybe this

const upload = multer({
    stream: (req, file) => {
        return writableStream();
    }
});

or both, but the point is that I still should be able to store my files easily and efficiently anywhere I want.

LinusU commented 7 years ago

Hey, sorry for the super delayed answer to this. I'll try to address some of your points here and hopefully we can have a further discussion :)

If I want to store a file in a given folder now I have to manually write all the code required for this because I have no way to configure it. This was trivial in v1. The extra work required seems a lot compared to just writing a string with a path.

This is true. I think that it would be trivial to build this functionality on top of multer though.

I looked at a lot of other APIs when coming up with this one and this seems to be how PHP, Django, etc. are doing it.

Personally I think that just storing the files in a folder without any further processing is quite uncommon. It also requires you to move stuff out from the upload directory after validating the files, instead of moving files in after they have been validating. This means that there could potentially be unvalidated files being exposed to other parts of the system for a short time, or if the process is ended at the wrong time, forever.

The temporary files will be left in the filesystem. A lot of developers doesn't know how to properly deal with streams or files; even when they do, they probably forget to delete the copy after piping the file. This only adds to the extra work required to simply "store a file".

If you look at the implementation you'll see that this is not actually the case. In fact, the file is unlinked right after it's opened, which means that the file will be removed by the operating system as soon as the fd is closed, so even if the process is force closed the file won't be left over...

If I don’t want to use the filesystem as my storage I still need to wait for all the I/O to finish before I can store my files elsewhere (the cloud, a database, etc.). Imagine this for a lot of large files.

This was the initial reasoning I had when designing the v1 API. But what I have realised is that it's very uncommon for anyone to want to actually upload the file straight to a cloud service without doing any inspection/validation of the file. In fact, we couldn't even see the file size before starting to upload it to the cloud, so when large files where uploaded we would start saving it in the cloud, and then have to abort when we realised that the file was too large. This is potentially costly and is something that we want to avoid.

Summary, this change consumes a lot of resources just to be able to tell you the file size and the mime type.

Actually, you'll be provided with the file size right away, and the mime-type will now be detected by looking at the file so you'll get both the browser sent mime-type and the detected one.

Maybe a way to tackle this problem is to accept a writable stream per file and pipe them automatically or expose busboy streams for consumption directly but definitely writing an extra temporary file inside a fixed folder will cause more problems than solutions.

I've been thinking about that approach as well, but in the end I felt that it was better to use Busboy right off then. It felt to complicated for most users...

A potential interesting way to go is to have a lower level API, that works more like busboy, that would give you complete control. Then we could build Multer on top of that api...

devconcept commented 7 years ago

Here are some thoughts on the subject

Personally I think that just storing the files in a folder without any further processing is quite uncommon. It also requires you to move stuff out from the upload directory after validating the files, instead of moving files in after they have been validating. This means that there could potentially be unvalidated files being exposed to other parts of the system for a short time, or if the process is ended at the wrong time, forever.

Following your reasoning, here are some facts that it doesn't consider

If I want to store files in a folder, I have to:

Read the request
Store the file in the temp folder (Why?)
Move it to upload folder
Process it
Move it to is final destination.

The fact that the extra step (number 2) is abstracted by your api only make it worse because there is no way that I can go around it.

Don't you think it would be better to have this instead?

Read the request
Store the file in the upload folder
Process it
Move it to is final destination.

Here are some facts that you need to consider for that extra step:

It requires double storage space for any application. This is because moving a file requires you to have space for the file been read and the file been saved. Even when you delete the file afterwards if the files are too large and there is a lot of traffic you still need the space to guarantee that file storage do not fail.

This is a huge problem for any successful application trying to scale because increases hardware costs and processing time (which is increased for every step in the system) making people to consider using Busboy directly and removing the middle-man.

Even when you have only four steps you cannot assume that I will use the file system as my intermediary storage method. I can upload the file to the cloud, the database or whatever; process it, and move it inside or to a different storage, none of them been the file system. This gives me freedom of choice as how to deal with my files.

Following in the subject of freedom; you based your api in frameworks like Django which are known to be monolithical while Expressjs and the Nodejs community are famous for been the other way around.

If I am a consumer of your module and you preserved the path functionality (I don't have to manually move the files myself), upgrading to a new version will be easier because the new api is similar to the old one for the majority of the applications. Almost everyone surely stores in a given folder and continue application logic after that.

Also consider that; if I'm trying to debug my application and I have to look in the OS temporary folder, my uploaded files will be along every other garbage file in my system. It wouldn't be easier if I only have to look in a folder that I designed for this so I can isolate the cause quickly? It makes debugging really hard.

What if I have an app that automatically cleans temporary files at regular intervals?

If you look at the implementation you'll see that this is not actually the case. In fact, the file is unlinked right after it's opened, which means that the file will be removed by the operating system as soon as the fd is closed, so even if the process is force closed the file won't be left over...

Maybe I didn't explain correctly on this one. What I meant was that if I am a new user of Nodejs I could use your module to learn how to store a file and when I have mastered the streams topic (which could take some time) I could further customize file handling for my application. The point of this statement is that this module is already serving a larger audience (from newbie to expert) so there is no point in making the api incredibly hard to use for any real life application. It would be wise to continue targeting all audiences because this doesn't add almost any implementation effort.

As an observation when I decided to aid in solving the problems the new api arose I forked and downloaded your module, installed all the dependencies and run its tests. The first thing I noticed was that there was an enormous number of files been created every time I run those tests and those files remained even when I close the IDE. While this could be a bug or other detail easy to solve you should think that this could happen to your users too. There is a lot of things that can impede a file been deleted (antivirus, locks, etc.).

This is the code you mention

result.files.forEach(function (file) {
    file.stream = fs.createReadStream(file.path)

    file.stream.on('open', function () {
        fs.unlink(file.path, function () {})
    })

    appendFile(file)
})

You don't handle the error and there is no way for your users to do it. You also assumed that I will use the stream you provided while the truth is that I can use the path and manipulate the file using a different streaming module. I think this is the cause of the bug, I never opened the stream you provide, therefore the file is never deleted. The point of the new api is to allow the use of any streaming module so this only work against that goal.

Actually, you'll be provided with the file size right away, and the mime-type will now be detected by looking at the file so you'll get both the browser sent mime-type and the detected one.

I wrote a PR with a proposal that allows you to learn the file size and the mime-type without having the extra step and still givin you a readable stream to further customize the file. This proves that you dont really need that step for anything.

There is something you are missing here. Middlewares are configurable building blocks that can be reused. Using Busboy directly is actually harder, error prone, and creates more boilerplate than using Multer with 2 or 3 configuration properties. You can still create a new api that is efficient, reusable and easy to use for everyone. Just take a look at the proposed Readme.

Maybe you don't like this solution or want to use other naming conventions but I found this way allows me to use any stream implementation without having to double store files, I could still know the size and the mime-type and I could reuse the middleware in multiple locations.

A potential interesting way to go is to have a lower level API, that works more like busboy, that would give you complete control. Then we could build Multer on top of that api...

Building everything from scratch is known to be a mistake sometimes. There is nothing wrong with your lower level dependency (in my humble opinion). You must remember that the goal of the new api is to allow the use of any stream implementation so changing the internal details doesn't add or remove anything to achieve this goal.

expressjs / multer

Using temp files in new api is not the right approach #495