valums commented 11 years ago

Hey,

It would be really great to find a way to calculate and send the hash of the file from the client to avoid uploading something that already exists on the server.

If this is possible for large files, it could save a lot of time for some users. And it would feel amazing.

I guess JS with FileReader would be too slow as of now, but it's worth checking anyway. If you see any mentions of native hashing functions included in the newest version, please let me know.

Cheers, Andrew

rnicholus commented 10 years ago

Seems possible, but only in browsers that support FileReader. We'll look into this more in a future release.

rnicholus commented 10 years ago

As far as I know, hash calculation time is proportional to the size of the file to be hashed. There is nothing we can do about this, as far as I can tell.

Another concern is running out of browser memory when reading the file before calculating the hash. This, we can probably deal with by splitting files into small chunks and feeding them into an MD5 calculator, chunk by chunk, until we are done. Most likely, a library such as SparkMD5 would be imported into Fine Uploader to handle hash determination. SparkMD5 adds some useful features (such as the ability to calculate a hash in chunks) on to fairly well-known md5.js script written by Joseph Myers. Unless I completely misunderstand the MD5 algorithm and the SparkMD5 source, this should allow us to ensure that we do not exhaust browser memory while calculating a file's hash.

It may be useful to calculate the hash in a web worker as well, so we can free up the UI thread.

rnicholus commented 9 years ago

related to #848

shehi commented 8 years ago

+1: I desperately need this feature in my CMS as well. Priorly I was using JumpLoader Java applet for upload purposes, and it had this feature - it could send the hash of overall file at the start of upload, and once server said "we don't have it, send it", it sent file in chunks and during every chunk request it also hashed the chunks. I see this issue is like 3 years old - is there any plans to implement this feature please? Thanks.

rnicholus commented 8 years ago

@shehi Funny you mention this - I already had to implement file hashing client-side in order to support version 4 signature support in Fine Uploader S3 (see #1336 for details). However, that only hashes small chunks, and not an entire file in one go. So, pushing this a bit further and calculating the hash of an entire file client-side is certainly doable, but I think more work is needed for larger files to prevent the UI thread from locking up.

In order to complete this feature, I expect to use V4 content body hashing as a model, but expand upon that code as follows:

[ ] Hash the file in a web worker wherever possible. Where not possible, we'll need to consider a spinner or something similar.
[ ] Determine if there is a more efficient way to hash an entire file, especially one that is multiple GB.
[ ] Determine which hashing algorithm to use. I would expect to use the one with least overhead.
[ ] Setup a new option that includes an endpoint to send the hash.

shehi commented 8 years ago

I hate this. I need to practice programming in advanced JS and NodeJS :( So out of loop here regarding the jargon.

I could actually work around this requirement with less subtle tactic: You said you can hash small chunks. That provided, I could use that to record the hash of first 2-3 chunks in database, and next time someone uploads a file, their chunk hashes could be cross checked against database, alongside with matching total file size. Same hash of first 100 kilobytes, for the file with exactly the same size = I think these odds can't be beaten. What you think? Of course, I also check file magic signatures for their real types, so there exists that parameter for cross-checking as well. I use TrID for this purpose.

http://mark0.net/soft-trid-deflist.html

rnicholus commented 8 years ago

Small chunk hashing is restricted to Fine Uploader S3, and only when generating version 4 signatures. I feel that this feature needs more input from other users before I consider implementing it. Some questions I have include:

Should we hash the whole file, or only a piece, or n pieces/bytes?
What hashing algorithm should we use?

These are questions that should be answered based on user input, as I want to create a solution that is generally usable.

shehi commented 8 years ago

Understandable. As I said, partial (as in few chunks) hashing should do the trick. And since it's small amount of data that needs hashing, who cares what algorithm is used? We are not hashing everything anyway - will be quick. Hashing could both be based on the number of chunks (instead of certain amount of bytes) or vice versa - in either case, that overall amount of data hashed should be limited for evident performance reasons.

rnicholus commented 8 years ago

As I said, partial (as in few chunks) hashing should do the trick

I'm not convinced that this is the correct approach, and am apprehensive about codifying this as the solution to this problem in a future version of Fine Uploader.

since it's small amount of data that needs hashing, who cares what algorithm is used

The server and client must use the same algorithm, otherwise the hashes will never match

shehi commented 8 years ago

It's either complete hashing, or it isn't. You already have been apprehensive regarding to implement the former, due to performance reasons alongside with some security restrictions certain browsers enforce. I don't think anyone would hesitate in doing that otherwise.

The other remaining approach is something limited, partial. And in this scenario you have plenty of data to cross-check file identity against:

File size
File magic signature (first few bytes of binary files and of certain text files are always the same)
Hashes

First two options we can check easily, even in server side. The latter can be achieved if you hash and record certain predetermined byte-ranges of files and check against those records during subsequent uploads. I believe with these 3 types of data, file identity can accurately be determined.

But there remains one problem: certain files, mostly media files, may have so called "header information", a metadata at the end of the file (please correct me if I am mistaken). Video files and image files with metadata are good examples (have to check and ascertain the location of metadata in those files though, not sure). Two different files, even with same type and magic signature can also have same trailing metadata bytes. That makes it hard to rely on this particular method.

No matter what you devise though, I believe toggle-able bad feature is always better than no feature. You can receive more inputs from community if people toggle half-baked feature on and experiment with it. Your call of course. But like this, this issue will sit here for more years to come :)

khoran commented 8 years ago

There are two features here that would be valuable to me. First is just to compute a checksum (md5 would be fine) for each chunk and send it along with the chunk. This way I can detect corruption during the upload right away and request that that chunk be re-sent. Secondly, sending the whole-file checksum upon successful completion of a file upload would allow me to verify on the server that all the chunks made into the right places in the file and give an end-to-end verification that everything was done correctly. Using SparkMD5, you could compute the overall checksum one chunk at a time while the file is uploading, so that very little extra time would be spend at the end.

rnicholus commented 8 years ago

The per-chunk checksum is already being calculated to support S3 v4 signatures, though it's not being used anywhere else at the moment. If each chunk is hashed, there isn't a good reason to re-hash the entire file as well, since this will add considerable overhead to the process, especially with very large files. As long as you combine the chunks in the order specified, the file is fine.

khoran commented 8 years ago

Is there currently a way to use the per chunk checksum when using a traditional server (not S3)? That would be valuable to me. You are correct, the overall file checksum is not strictly required, just a little extra paranoia. I'd be happy if I could use the per chunk checksum with a traditional server. Thanks.

rnicholus commented 8 years ago

Is there currently a way to use the per chunk checksum when using a traditional server (not S3)?

No, but it likely wouldn't be very difficult to integrate this into the traditional endpoint uploader.

rnicholus commented 7 years ago

This is something I'm looking into now. I don't see this being a feature implemented inside the Fine Uploader codebase. Instead, a couple small changes will be needed to make it possible to write some code that makes this possible using the existing Fine Uploader API and event system. I'll tackle this by making the required changes to Fine Uploader and then I'll write up a small integration example (probably with updates to an existing server-side example as well) that will make it easy for anyone using Fine Uploader in their project to benefit from this feature. My plan is outlined below.

Duplicate file detection on upload

Usable with any project that uses Fine Uploader.
Big win for large files.
2 possible ways to implement this, with "plan A" being the ideal.

For both plans, consider the following:

On my MacBook Pro, it takes 5 seconds to hash a 200 MB file in the browser. It will probably take less than a second to ask the server if that hash exists elsewhere. So, about 6 seconds. In either plan, a successful upload must include the client-side hash, which must be stored in the DB for future duplicate detection. If the 200 MB file is a duplicate and we uploaded it anyway, it would take 7 minutes to needlessly upload that same file on my home internet connection (which is quite fast). So, if the file is a duplicate, this 7 minute upload will be skipped entirely.

Also understand that changes to Fine Uploader will be minimal. The hashing and server communication is something that integrators will take on. I'll provide a simple example implementation as part of this issue.

Plan A

Start uploading the file immediately and start the hashing/duplicate detection process at the same time. Then cancel the upload once the file has been found to be a duplicate. The time to hash and ask the server to run a duplicate check does not adversely affect the upload time, in case the file is not a duplicate. The hypothesis here is that this is the ideal approach in terms of conserving user time.

Tasks:

[x] Update Fine Uploader to accept a "reason" for a cancel API call. A file canceled with a reason (such as "duplicate") will remain visible in Fine Uploader UI. Ideally the cancel message would be displayed as status in the upload card. The onCancel event will include the passed reason in this case, so other UI implementations (such as React Fine Uploader) can also keep the file representation visible, indicating the passed reason. These are the only changes to Fine Uploader.
[ ] Generate hash of a large file by breaking it into chunks and hashing each chunk: Blob.slice, FileReader, ArrayBuffer, and SparkMD5. You must do this in your own project.
[ ] Send request to server endpoint w/ calculated hash. If duplicate, cancel upload w/ message to display on file card. You must do this in your own project.
[ ] After (if) upload completes, send a request to the server w/ the file hash. Server must save this hash with the file record for future duplicate file queries. You must do this in your own project.

Plan B

Check for a duplicate first. Reject if duplicate, otherwise start upload. Since the upload is delayed until hashing an duplicate detection is complete, this will add 6 seconds to the 7 minute file upload.

Tasks:

[ ] Update Fine Uploader to optionally display rejected files. A duplicate file will be rejected in onSubmit, but we want to user to see the file in the upload UI anyway. Ideally the rejection message would be displayed as status in the upload card. These are the only changes to Fine Uploader.
[ ] Observe onSubmit callback in Fine Uploader. At this point, check to see if the file is a duplicate. You must do this in your own project.
[ ] Generate hash of a large file by breaking it into chunks and hashing each chunk: Blob.slice, FileReader, ArrayBuffer, and SparkMD5.
[ ] Send request to server endpoint w/ calculated hash. If duplicate, reject w/ message to display on file card. You must do this in your own project.
[ ] If file is not a duplicate, do not reject and include the hash as a parameter with the file. Server must store this hash alongside the file record for future duplicate file queries. You must do this in your own project.

shehi commented 7 years ago

Indeed, Plan-A sounds more reasonable.

On 02-10-2016 06:03, Ray Nicholus wrote:

/duplicate detection

stayallive commented 7 years ago

Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server? I see how this is probably a bit of a stretch since if you have the hash of a file you probably have the file already too, but it might be something to take in consideration.

rnicholus commented 7 years ago

Wouldn't this allow anyone to add a file to their own uploads as long as the hash is known of any file on the server

I'm not sure I follow. This is simply a check to determine if a file exists on the server, given its hash. If it does exist, then the file is not uploaded. Can you explain the issue you are seeing a bit more?

stayallive commented 7 years ago

Well if I upload file foo.docx with hash 123 (both examples ofcourse). And given there are multiple users, then another user could simply send 123 as a hash faking a upload with that hash and gain access to my foo.docx.

However as I mentioned above this might be a extremely low "risk" since if I know the hash of foo.docx I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.

shehi commented 7 years ago

Yea, Alex has a point. This being client-side tech, there are plenty of ways to spoof the data being sent. Nevertheless, we should have this feature for those who are willing to opt for it.

On 02.10.2016 23:07, Alex Bouma wrote:

Well if I upload file |foo.docx| with hash |123| (both examples ofcourse). And given there are multiple users, then another user could simply send |123| as a hash faking a upload with that hash and gain access to my |foo.docx|.

However as I mentioned above this might be a extremely low "risk" since if I know the hash of |foo.docx| I probably already have access to it. Dropbox uses similair techniques to optimize their storage but generates hashes server side making them secure for users spoofing hashes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-250992470, or mute the thread https://github.com/notifications/unsubscribe-auth/AAn5Ac4YKFENcpVckoVMbtirBNujXiMbks5qwA8ZgaJpZM4AW9EM.

stayallive commented 7 years ago

It's definitly an awesome feature to have. It might be possible to upload a small block of the file (say max. 1mb) and also save that hash server side so that if the file is already on the server the client can prove it has the file by uploading a small portion so that the section can be validated server side. Although that might add too much complexity but makes if more usable for multi-user systems. Although I'm not a security expert so that solution could be as insecure as the original.

rnicholus commented 7 years ago

another user could simply send 123 as a hash faking a upload with that hash and gain access to my foo.docx.

How would they "gain access"? As I said before, if the file hash exists, then the file simply isn't uploaded. No one is provided access to anything.

Regardless, it's up to you to implement the feature however you want. This is really not a "feature" of Fine Uploader. It won't be baked into the library. My example will follow the plan described a few posts back.

rnicholus commented 7 years ago

This being client-side tech, there are plenty of ways to spoof the data being sent.

A spoofed hash doesn't harm anyone other than the uploader, as their file simply won't be uploaded. At least, that's how I would implement this feature.

stayallive commented 7 years ago

Ah. I totally read that part wrong. I thought this would be part of FineUploader and totally missed that. Sorry about that :)

rnicholus commented 7 years ago

The only planned change to Fine Uploader is that which is documented in "plan A" above (see the first "task"). The rest will be an integration which will be demonstrated as described in the same plan.

stayallive commented 7 years ago

Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.

But This is ONLY in case of a multi-user environment or shared storage. For a single repository of files this is ofcourse irrelevant since there are no access controls in place.

But all of this is offtopic for the described changes. Sorry for misreading.

rnicholus commented 7 years ago

Well a spoofed hash cannot be detected by the server without the file correct? So if I know the hash of a file on the server I could link it to my user or something.

Sorry, I'm really not following your logic at all. What does "link it to my user" mean? How would you do this?

rnicholus commented 7 years ago

I suggest that, if you do end up implementing this feature into your own project, you not simply serve up files without checking for appropriate permissions. The file hashing feature described here isn't really relevant to this discussion of security, since it's not meant to be anything more than a hash comparison.

stayallive commented 7 years ago

Well as I said this is only a concern in a multi user environment. Consider Dropbox for example. They use file hashes to deduplicate the files uploaded to them on the filesystem. If those hashes are client side they could possibly be spoofed by a client telling I have a file with hash foo without having the file (since it's all client side this could easily be done). Server finds the hash in the database and confirms it exists and completes the upload and "links the file" on the server to the spoofed upload of the client.

I thought this was laying the groundwork for inplementing the client side of this mechanism that could lead to the above potentionally insecure server implementations as mentioned above. So yes, this is not a concern for the changes described above.

rnicholus commented 7 years ago

Ah, I see, you had a very specific implementation in mind. But this is not something that any client-side library could ever prevent against. The best defense against this is to ensure you never blindly serve sensitive resources. Instead, a permission check server-side of some sort is prudent.

rnicholus commented 7 years ago

At this point the only changes to Fine Uploader planned to make duplicate file detection possible are represented by the above 3 commits.

shehi commented 7 years ago

@rnicholus , it is not about the weakness of your implementation, it's about client-side. Now imagine this scenario: Imagine some leaker, from within a company, leaks the hash of a file (e.g. hash = "123"). Mal-intended user, who receives this hash, manipulates client-side into believing that they are uploading a file with hash "123". The server side, seeing hash "123" already exists in its storage(s), cancels upload and just grants additional ownership access to that mal-intended user. Your code, being an open source solution, can easily be altered to make these hacks - I think we both can agree on this one. So, how can we make sure that we stop this potential threat of breach?

I just did quick brainstorming and I concluded sending the hash of random byte ranges from the file being uploaded to the server side, like 10 of such byte-ranges (including the head range of 0-100 and tail range of 0-100 for initial magic signature verification) would be a strong protection against such breach. Server side, receiving the hash alongside with these byte-ranges could do additional verification. The would be thief can't possibly fake this unless they have the actual file.

Now I don't know if it is possible to extract certain byte-ranges from the file, using client-side Jscript. If it is, good for us. Otherwise, it will be a challenge.

Additionally, all these sub-features (byte-range extraction etc) should be configurable (e.g. how long ranges should be, how many of them should be extracted etc).

P.S.: When I first +1'ed this feature, I already explained why I did so. My reasons were clearly for multi-user environment. And I strongly believe this implementation of yours should address such environments as well, otherwise it will be a half-baked solution many of us won't be able to use due to security implications.

rnicholus commented 7 years ago

@shehi I want you to understand that you are advancing a straw man argument. Hopefully you will be convinced after reading my message below. Either way, I'd like this discussion to move back towards the specific items in "plan A" from this point forward instead of going on and on about a specific flawed implementation of this that will never be part of Fine Uploader.

The server side, seeing hash "123" already exists in its storage(s), cancels upload and just grants additional ownership access to that mal-intended user.

Simple: don't do this.

We're going in circles at this point. I've already mentioned, several times, that Fine Uploader isn't going to internally implement the hashing or duplicate file detection code. You must do this. And you are free to take any security precautions you see fit when you do this based on the nature of your project. Fine Uploader will only be modified to allow a file to be canceled with a "reason". That's it. Nothing more. The hashing and server communication piece will be demonstrated as an integration point. I'll likely write an article and provide some sample code. I've updated my initial post above to make this even more clear as well.

If your project is coded such that it blindly serves up resources without appropriate permission checks, nothing can be done client side to fix that.

Once again, Fine Uploader will not generate the file hashes itself or contact your server to determine if a file is a duplicate, based on the hash. This is entirely up to you. As with anything, keep appropriate security in mind as you code.

shehi commented 7 years ago

Ray, I totally understand you. All I asked for, were some facilities so that I could actually pull off a secure application at the end. The reason I gave you that scenario sample was for you to understand under which circumstances I would use it, that's all. So question is, would you, as a developer of this awesome package, give such capabilities to us? I understand that you are coding a capacity to your suite, for it to be able to handle "cancellations due to certain reasons/promises"; what I am trying to say is that, there exists certain scenarios your suite could be extremely useful, if it could do what it already does with a little bit of extra juice (in this case, hashes of random byte-ranges from uploaded file, to be sent along with full hash, to enable us to create awesome solutions - and open source ones at that). For this, do I need to create an extra feature-request issue, or is it a "won't/can't do" case?

On 03-10-2016 15:06, Ray Nicholus wrote:

@shehi https://github.com/shehi I want you to understand that your are advancing a straw man argument. Hopefully you will be convinced after reading my message below. Either way, I'd like this discussion to move back towards the specific items in "plan A" from this point forward instead of going on and on about a specific flawed implementation of this that will never be part of Fine Uploader.
The server side, seeing hash "123" already exists in its
storage(s), cancels upload and just grants additional ownership
access to that mal-intended user.
Simple: don't do this.

We're going in circles at this point. I've already mentioned, several times, that Fine Uploader isn't going to internally implement the hashing or duplicate file detection code. /You/ must do this. And you are free to take any security precautions you see fit when you do this based on the nature of your project. Fine Uploader will /only/ be modified to allow a file to be canceled with a "reason". That's it. Nothing more.

If your project is coded such that it blindly serves up resources without appropriate permission checks, nothing can be done client side to fix that.

Once again, Fine Uploader will not generate the file hashes itself or contact your server to determine if a file is a duplicate, based on the hash. This is entirely up to you. As with anything, keep /appropriate/ security in mind as you code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-251089881, or mute the thread https://github.com/notifications/unsubscribe-auth/AAn5AR6gxKmaDCLO6R6y-GyjByvdgtK1ks5qwO_DgaJpZM4AW9EM.

shehi commented 7 years ago

Wait, did I misread it? So FineUploader isn't going to calculate the whole hash in parallel to uploading, and cancel when server tells it "hey stop, we already have this one"?

On 03-10-2016 15:06, Ray Nicholus wrote:

@shehi https://github.com/shehi I want you to understand that your are advancing a straw man argument. Hopefully you will be convinced after reading my message below. Either way, I'd like this discussion to move back towards the specific items in "plan A" from this point forward instead of going on and on about a specific flawed implementation of this that will never be part of Fine Uploader.
The server side, seeing hash "123" already exists in its
storage(s), cancels upload and just grants additional ownership
access to that mal-intended user.
Simple: don't do this.

We're going in circles at this point. I've already mentioned, several times, that Fine Uploader isn't going to internally implement the hashing or duplicate file detection code. /You/ must do this. And you are free to take any security precautions you see fit when you do this based on the nature of your project. Fine Uploader will /only/ be modified to allow a file to be canceled with a "reason". That's it. Nothing more.

If your project is coded such that it blindly serves up resources without appropriate permission checks, nothing can be done client side to fix that.

Once again, Fine Uploader will not generate the file hashes itself or contact your server to determine if a file is a duplicate, based on the hash. This is entirely up to you. As with anything, keep /appropriate/ security in mind as you code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-251089881, or mute the thread https://github.com/notifications/unsubscribe-auth/AAn5AR6gxKmaDCLO6R6y-GyjByvdgtK1ks5qwO_DgaJpZM4AW9EM.

rnicholus commented 7 years ago

Fine Uploader isn't going to internally implement the hashing or duplicate file detection code. You must do this. Please read my initial plan above before you comment any further.

shehi commented 7 years ago

My apologies, I totally missed (funny part upcoming) the bold text of "You must do this in your own project." :) Man, I am supposed to read the bold part first... (or did you add those parts a few mins ago?)

rnicholus commented 7 years ago

In fairness, I added/bolded that a few comments back to clarify the goals of this "feature". This was clear to me initially, but it seemed I did not convey this clearly to others after seeing comments from you and @stayallive.

Fine Uploader's API and event system will be modified to make it easier for integrators (such as yourself) to implement a duplicate file detection workflow as described earlier. In fact, I've already made all planned changes to Fine Uploader in the 585 branch (these will be part of 5.12.0). See the 3 commits a few posts above for details.

I'll likely also provide a sample integration and detail it in an article (with usable code). It's possible that this "sample integration code" will be further generalized into a standalone library that can be plugged into Fine Uploader (or some other library) to make duplicate file detection even easier to implement.

I'm not writing the hashing and duplicate file detection code as part of the Fine Uploader library for the following reasons:

This would make Fine Uploader larger and more complex. It's already quite large and exceptionally complex.
It really doesn't need to be part of the core library. Going forward, features like this should be integrated by example or standalone generic libraries.

shehi commented 7 years ago

I totally agree and thank you for all the hard work you are doing!

khoran commented 7 years ago

Great to see this being worked on. I was wondering though if you were still planning to add the per-chunk checksum feature I had mentioned previously (Feb 22)? I know that feature is available with an S3 backend, but it would be useful to me to have it for the traditional server as well. My use case is people in togo, nigeria, and other parts of the world that have very poor internet connections. They are uploading large files, usually greater than 2 GB. Network corruption is very common from these areas. So it's a real pain to spend several days (in some cases) uploading the large file only to find that the checksum doesn't match at the end. So checksuming each chunk would allow me to catch the corruption right away and re-send the chunk. I could probably compute the checksum in the onUploadChunk function, but is there any way currently to send the checksum to the server along with the chunk? If not, that would be great to have.

Thanks.

Kevin

rnicholus commented 7 years ago

wondering though if you were still planning to add the per-chunk checksum feature I had mentioned previously

I don't recall this feature. Do you have an issue number? At this point, my hands are full with maintaining/supporting Fine Uploader, working on this duplicate file detection case, working on React Fine Uploader, and a number of other projects, and this is in addition to my 9-5 work. So, I don't see the feature you speak of making it into Fine Uploader anytime soon unless someone else contributes the changes. Also, instead of baking this into Fine Uploader, I would most likely mandate that Fine Uploader instead be modified to make this possible. In other words, make the onUploadChunk callback accept a Promise as a return value and allow for per-chunk parameters to be specified via a new API method or a non-breaking update to an existing one.

khoran commented 7 years ago

It was a message on this same issue, just scroll up (https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-187515189). That sounds like a good solution. I'll make a new issue for this request. Thanks.

On 10/03/2016 10:32 AM, Ray Nicholus wrote:

wondering though if you were still planning to add the per-chunk
checksum feature I had mentioned previously
I don't recall this feature. Do you have an issue number? At this point, my hands are full with maintaining/supporting Fine Uploader, working on this duplicate file detection case, working on React Fine Uploader, and a number of other projects, and this is in addition to my 9-5 work. So, I don't see the feature you speak of making it into Fine Uploader anytime soon unless someone else contributes the changes. Also, instead of baking this into Fine Uploader, I would most likely mandate that Fine Uploader instead be modified to make this possible. In other words, make the |onUploadChunk| callback accept a |Promise| as a return value and allow for per-chunk parameters to be specified via a new API method or a non-breaking update to an existing one.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/FineUploader/fine-uploader/issues/585#issuecomment-251170698, or mute the thread https://github.com/notifications/unsubscribe-auth/ABRzuPAjG03wGnSSKNB1zXHM9pMDIp0Dks5qwTwhgaJpZM4AW9EM.

rnicholus commented 7 years ago

Changes in this case have been released as 5.12.0-alpha.

chadwackerman commented 7 years ago

Just stumbled onto this and wanted to point out the WebCrypto API. It's widespread in browsers and even works in Safari. Also, MD5 is dead in 2016. It's so bad WebCrypto doesn't support it. Any client-side solution for this has holes (you always need to verify on the server) but no reason to double down with a dumb hash, too. Use SHA-256 and get on with life.

https://github.com/taher435/web-crypto-api-file-hash

rnicholus commented 7 years ago

Md5 is perfectly adequate for this feature, and in fact will make life much easier for integrators who want to backfill their DBs with hashes for files already stored in S3. Since AWS already stores the md5 hash for each file in the object's etag header, it's easy and efficient to capture hashes for all files as part of the initial integration effort.

rnicholus commented 7 years ago

After further thought, I think it's important to maintain a zero tolerance policy for rude comments. So, @chadwackerman, I have removed your most recent comment and banned you from the repository. I encourage you to read https://github.com/jonschlinkert/maintainers-guide-to-staying-positive#help-or-do-no-harm before you interact with any project/maintainer going forward.

rnicholus commented 7 years ago

A common theme here is a misunderstanding of the context in which MD5 is used here. Summary: it's not used for security. Instead, we're using it to identify content. Linus Torvalds explains the difference in a detailed post on the recent SHA1 collision attack.

dsoprea commented 6 years ago

+1

Thoughts on future steps?

rnicholus commented 6 years ago

I created a functional prototype on https://github.com/FineUploader/fine-uploader/tree/585-duplicate-file-detection which I used for an internal demo. Cleanup, tests, etc are still needed.

headdab commented 6 years ago

I would love this feature too. Any updates? status?

FineUploader / fine-uploader

Calculate and send the hash of the file on the client to avoid uploading something that already exists #585

Duplicate file detection on upload

For both plans, consider the following:

Plan A

Tasks:

Plan B

Tasks: