Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 132 forks source link

Add client-side avatar upload image size validation #3114

Closed cblanken closed 3 months ago

cblanken commented 7 months ago

A simple proof-of-concept addressing issue #3100. At the moment, it just pops an alert() if the target file is more than 1MB and stops the form from submitting. image

If there's a different preferred method to display client-side notifications or any other corrections I should make, please let me know and I'll implement them.

Yorwba commented 7 months ago

This solution certainly gets the job done! However, you're hardcoding the error message so it can't be translated. That definitely has to change, e.g. by PHP-interpolating the string into an HTML attribute within the edit_profile.ctp template.

If you want to display the validation error in a slightly prettier way, you can also have a look at register.ctp and register.ctrl.js for inspiration.

But that's entirely optional. Otherwise, in addition to making the message translatable, I'd only rename avatar.upload.js to edit_profile.js so it's clearer which template the script belongs to, and use IDs to locate the file and submit inputs instead of your current solution with querySelector.

jiru commented 7 months ago

Thank you for this PR! 🙂

I have not tried your branch yet, but from your commit I guess the file size check won't work if one uses keys to navigate the form and submit it using the Enter key. I think you want to listen the submit event instead of click.

cblanken commented 7 months ago

I have not tried your branch yet, but from your commit I guess the file size check won't work if one uses keys to navigate the form and submit it using the Enter key. I think you want to listen the submit event instead of click.

Interestingly, it does actually still work when tabbing and using the Enter key. I wonder if Chrome automatically falls back to the "click" event for that :thinking:?

Of course, you're right though. It should definitely be listening for the form "submit". I'll fix it up.

cblanken commented 7 months ago

This solution certainly gets the job done! However, you're hardcoding the error message so it can't be translated. That definitely has to change, e.g. by PHP-interpolating the string into an HTML attribute within the edit_profile.ctp template.

If you want to display the validation error in a slightly prettier way, you can also have a look at register.ctp and register.ctrl.js for inspiration.

But that's entirely optional. Otherwise, in addition to making the message translatable, I'd only rename avatar.upload.js to edit_profile.js so it's clearer which template the script belongs to, and use IDs to locate the file and submit inputs instead of your current solution with querySelector.

Thank you, I was going to ask how best to go about localization as I wasn't really sure where to start. I'll look into this.

cblanken commented 7 months ago

@Yorwba I've applied the fixes you described as far as I can tell.

I'd like to make the error notification prettier at some point, but I know next to nothing about AngularJS, so I may have to come back to it later. I don't think I'll be able to setup anything like register.ctp and register.ctrl.js until I have time to dig into AngularJS. Although, I'd prefer not to waste time wtih it if there's going to be a migration to something else like mentioned in issue #3105.

jiru commented 7 months ago

@cblanken I tried your branch and it did not work as excepted. I tried uploading various image size and quickly figured out the correct value is lower than 1024*1024. I tried crafting image files using this command dd if=/dev/zero count=1 bs=1047437 of=~/images/1MB.png (of course the image contents are invalid but I can see if such file produces an nginx 413 error or an image decoding error.

So it turns out the nginx size check is not on the uploaded file but on the entire HTTP body request (which happens to contain a form which happens to contain a file). For example, the above dd command works with the empty "contributor" user profile, but if you type something in the profile description it produces a 413.

I don’t think we can predict the size of the request body. The only solution I can think of is to use different-enough values for nginx client_max_body_size and the form file check so that in practice users never get 413s.

cblanken commented 7 months ago

I don’t think we can predict the size of the request body. The only solution I can think of is to use different-enough values for nginx client_max_body_size and the form file check so that in practice users never get 413s.

Ideally, couldn't the upload image form be isolated from the rest of the profile form? That should make the upload size consistent since it would only be the image_size + form_size which should be static when there aren't any other inputs attached like the user's description.

jiru commented 7 months ago

For example, the above dd command works with the empty "contributor" user profile, but if you type something in the profile description it produces a 413.

Sorry that was a mistake. The profile form and image upload form are separate.

The problem is that the browser is serializing form contents into an HTTP request in a way we cannot predict. For example this is a capture of the HTTP body of an upload of a 100 bytes image file consisting of y bytes.

-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="_method"

POST
-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="_csrfToken"

d392b13f5852570d35f7ca01050c55abc524078dd1123b38feb84552bd2361c95b6167e9bd20cea79dea1fade8a3b2aab5ae54a8b9046cd6a8abf7ae462b1f5c
-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="image"; filename="100.png"
Content-Type: image/png

yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="_Token[fields]"

ec7b528d9b2bfa733e3707d5debc534da25661b2%3A
-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="_Token[unlocked]"

-----------------------------16848676312674021315335614183
Content-Disposition: form-data; name="_Token[debug]"

%5B%22%5C%2Fen%5C%2Fuser%5C%2Fsave_image%22%2C%5B%22image.name%22%2C%22image.type%22%2C%22image.tmp_name%22%2C%22image.error%22%2C%22image.size%22%5D%2C%5B%5D%5D
-----------------------------16848676312674021315335614183--

The total request body size is 1238 bytes (it uses CRLF). I don’t think we can really predict the boundary text of -----------------------------16848676312674021315335614183, the file name etc.

jiru commented 7 months ago

I don’t think we can really predict

Well it turns out there is a way:

var myForm = document.getElementById("upload-avatar-form");
formData = new FormData(myForm);
const response = new Response(formData);
response.blob().then(blob => console.log(blob.size));

1238
jiru commented 7 months ago

Also I would like to recommend avoiding hardcoding the value 1024*1024, you can use a meaningful variable name and maybe add a comment mentioning nginx client_max_body_size variable.

cblanken commented 7 months ago

I've added a check for the form size. At the moment it's 1MB + 4KB since I wasn't sure if we wanted to bump up the client_max_body_size to allow for 1MB avatar images or reduce the MAX_IMAGE_BYTES such that it fits within the default client_max_body_size. The alert uses the same error message as the image size check. I think it would be confusing to have a distinction to the user. All they need to know is the image they submitted is too large.

I also changed up the image size check so it triggers immediately on upload to the form so the user doesn't have to add the image and click the "Upload" button. The alert just pops up immediately, and the form is reset automatically so the user can't accidentally submit it if it's too big.

jiru commented 6 months ago

Thank you! Your changes look good. Maybe some refactoring can be done around the 3 alert() calls and preventDefault()?

If we are to increase client_max_body_size, I believe it should be much larger, but I am not sure what’s a reasonable value, and I’m not sure about security implications. Until we figure this out, you can make your code assume client_max_body_size is 1024*1024.

cblanken commented 6 months ago

While refactoring this, I decided to remove the secondary detection for the image size. We only really care about the total form size anyway, and the error message to the user will be the same regardless. This makes it much simpler and there's no need to wait for the form submission or use preventDefault().

If you think there's a real need for a separate check for the exact file size, let me know and I can work it back in, but I think this solution accomplishes the same goal in a simpler manner.

jiru commented 6 months ago

Thank you! I think it’s fine this way. :+1:

By the way, why the catch() code? What kind of exceptions does it catch?

I tested your code with a file of size 1047437 bytes, and I discovered it sometimes did not trigger the alert. The reason is that the boundary string (as in multipart/form-data; boundary=---------------------------273766397821774083071161053295) used by Firefox is based on an integer that varies in size. Sometimes it’s 55-char long, sometimes it’s 57. The boundary string appears 6 or 7 times in the request, adding up to a potential 10-20 bytes difference in size each time. If the boundary string used during the check and during the actual form submission are different, it can result in a 413 error. While it’s unlikely to happen in real life, maybe we can take 100 bytes off or so, just to be safe.

cblanken commented 6 months ago

By the way, why the catch() code? What kind of exceptions does it catch?

I had assumed the response.blob() might throw an exception, but evidently not. See Response.blob(). Nothing else in there can throw an exception either, so I'll remove it :+1:

I tested your code with a file of size 1047437 bytes, and I discovered it sometimes did not trigger the alert. The reason is that the boundary string (as in multipart/form-data; boundary=---------------------------273766397821774083071161053295) used by Firefox is based on an integer that varies in size. Sometimes it’s 55-char long, sometimes it’s 57. The boundary string appears 6 or 7 times in the request, adding up to a potential 10-20 bytes difference in size each time. If the boundary string used during the check and during the actual form submission are different, it can result in a 413 error. While it’s unlikely to happen in real life, maybe we can take 100 bytes off or so, just to be safe.

Wow, that's strange. I couldn't get an image to the exact size to confirm, but I see what you mean when looking at a packet capture. It might be possible to rework the form to submit the exact form data (same boundary strings included) via JavaScript if that sounds like something worth doing. Otherwise, I think you're right. The small padding should be good enough.

Anyway, I've added a padding of 100 bytes like you mentioned. :+1:

2colours commented 6 months ago

I think going by the client-side, preventive approach doesn't mean the nginx threshold cannot be increased at all. It would be clearer if the client-side code didn't try to account for the nitty-gritty details of the nature of that threshold. It could simply be that the code checks for 1MB exactly and the nginx server accepts considerably larger (say, 1,5 MB or 2 MB) payloads technically.

jiru commented 6 months ago

@2colours Sure, but this solution has maintenance benefits. It saves us from having to think about it again in the future. Next time we’ll decide to increase client_max_body_size, we can just copy this value into the JS code without ever having to wonder what’s a correct "considerably larger number".

2colours commented 6 months ago

@jiru

we can just copy this value into the JS code without ever having to wonder what’s a correct "considerably larger number"

This is not true. The "considerably larger number" is hardcoded into the client as 100 bytes and it's not foolproof, it just "seems to work". This has no conceptual advantage over deciding on a number that serves a similar purpose on server side. Moreover, if that number is like 1MB, it's uncomparably less probable some serializing will run out of the HTTP size.

cblanken commented 5 months ago

Is there anything else I can do to make this PR ready? Or does it require some more discussion to decide on an ideal client_max_body_size for the server to better handle different payload sizes?

jiru commented 3 months ago

@2colours I think you misunderstood the purpose of the 100 bytes. If you think it is not foolproof, you are welcome to open a new issue describing a specific case when it does not work. I believe the advantage is not to be considered in terms of conception but in terms of code maintenance cost.

@cblanken It’s alright. Sorry it took so long to merge!

2colours commented 2 months ago

If you think it is not foolproof, you are welcome to open a new issue describing a specific case when it does not work.

@jiru I think you have this backwards. It's wrong until proven right, not the other way around. All that I can see here is that the data serialisation is unpredictable and that "100 bytes or so" will just do.

I believe the advantage is not to be considered in terms of conception but in terms of code maintenance cost.

Why not both? It feels that what I said just went unnoticed.

There is a hack applied at client side and we are specifically trying to cover a corner case. Now we seem to believe that all files that pass the client-side check will be accepted by the server. What is certain, though, is that somebody can compress a file to a valid size that will be already refused at client side, regardless whether it would be currently accepted by the server or not (in some cases it would be, in some others it would overflow).

An equally simple and maintainable solution, as opposed to the arbitrary "padding", would be to make the "padding" at server side only (that's one hardcoded value swapped for one hardcoded value). There are two immediate benefits to this:

The only disadvantage of this approach is that you can't do calculations in nginx config files (you can even create variables, sigh, just not do math on them). If the limit has to be changed (how often do you think this would happen, by the way?), indeed somebody has to do the math offline, rather than directly rewrite the value. Do note, though, that this is still one change both in the client and in the server, just like with the client-side hack; the only difference is that the client-side hack uses Javascript to make this one calculation.

Of course one could further iterate and try to mitigate this disadvantage but I feel it's already overstated. Having to make a calculation that can be clearly indicated in a comment, every 5 years, and on the other hand you have something that is more flexible and actually does what it communicates.