cryptomator / cryptomator

Cryptomator for Windows, macOS, and Linux: Secure client-side encryption for your cloud storage, ensuring privacy and control over your data.
https://cryptomator.org
GNU General Public License v3.0
11.77k stars 1.02k forks source link

Drive File Stream Quota Management #1093

Open whitephoenix117 opened 4 years ago

whitephoenix117 commented 4 years ago

I believe this will fit the requirements for a bug. I apologize ahead of time for the length.

Note, as I am not a google engineer I am taking some liberties with how exactly Drive File Stream works based on my observations

Edit: Added Case 3

Background

Google Drive file stream allows you to "stream" files from drive without having them synchronized locally. This is very helpful for managing disk space.

Normally when a file is accessed via drive stream google's "magic" will allow you to download only the specific portion of a file that are required for the task needed, similar to how a physical disk only accesses the sectors of data needed. For example if you are viewing a 90min video file Drive FS will only download the blocks related to the 2 minutes that your player is locally buffering, this includes seeking to an arbitrary point within the video. Drive FS will then continue to download blocks as requested by the OS, just like a traditional disk.

Typically when a file is accessed by "streaming" Drive FS will only create a single download request for a series of blocks form the disk as more blocks are needed it will "resume" this download until the needed blocks have been downloaded and provided to the OS. This resume process will repeat as needed.

Ok so whats the problem?

Quotas.

For "security reasons" Google limits the number of "download requests" and if you exceed it they ban you for 24 hours. For security reasons google does not publish exactly what these limits are.

Cryptomator breaks Drive FS's ability to "resume" downloads; creating a massive number of requests to google's servers and will result in you getting banned.

Things I have noticed that trigger excessive download requests to google

-Browsing/ scanning a large vault (1 download/ file/ folder/ etc to get basic metadata) -Searching a vault (same reason as above) -Opening/ consuming large files

[Summarize your problem.]

Google will ban you

Windows 10 64bit Cryptomator 1.4.15

Steps to Reproduce

Case 1

  1. Open a vault that is not stored locally (online only) or cached by Drive FS
  2. Check your download log from google
  3. You will see a unique download request for every file/ directory you browsed inside the vault.
  4. Get banned by google

Case 2.

  1. Create an .ISO file container, fill it with a bunch of stuff, lets say 20,000 pictures.
  2. Store the ISO in the vault in Drive FS
  3. Ensure the ISO is in "online only" mode (no local downloaded/ cached copy)
  4. Mount the ISO using the mount tool of your choice
  5. Start a "slide show" viewing a new picture every few seconds
  6. Check your download log from google
  7. You will see a unique download request for each picture being shown, despite them being in a single file container (.ISO file)
  8. Get banned by google

Edit: Added case 3 Case 3:

  1. Load a video file into the vault
  2. Ensure the video is in "online only" mode (no local downloaded/ cached copy)
  3. Play the video
  4. Check your download log from google
  5. You will a larger number of individual download requests for the same file
  6. Get banned by google

Edit: Here is an example of a google access log, you can see there are multiple download requests per second for the same file totaling ~1,400 in 20 minutes.

Expected Behavior

-1 download "request" for each file accessed -Meta data management/ local caching to prevent file explorer activities from trigger a download for each file/ directory in the vault.

Actual Behavior

Many many many download requests for each file accessed

Reproducibility

Always

Additional Information

Can provide on request

overheadhunter commented 4 years ago

Interesting observation, thanks for sharing this.

Case 1: Accessing multiple files (i.e. during a search operation)

I don't believe there is really anything we can do about it. Cryptomator is just the middleman between the process accessing files and the underlying file system. If a process decides to not just look at metadata but actually read from files, Cryptomator has to obey and accesses the corresponding ciphertext file, thus triggering a download.

Case 2: Accessing multiple blocks within a single large file:

Of course there is no way to tell the underlying file system to "resume a download", since there is no API for this. However, we can investigate how our access pattern looks like.

It should be:

open file
read file
read file
read file
close file

It should not behave like:

open file
read file
close file

open file
read file
close file

open file
read file
close file

Components we have to look at: fuse-nio-adapter (linux/mac), dokany-nio-adapter (win) and cryptofs. @cryptomator/libraries

whitephoenix117 commented 4 years ago

Case 1:

Would it be possible to have an encrypted local DB ?sqlLite? That would be able to store basic file information to try and manage the impact of this? Of course this would need to be optional and treated more like a cache to manage all the syncing related issues/ conflicts

overheadhunter commented 4 years ago

How would you define "basic file information"?

For metadata like file name, size and modification date it is already not required to download a file.

infeo commented 4 years ago

I suggest to transform this issue into a feature request and optionally open up a bug report to investigate the behaviour mentioned by @overheadhunter .

Cryptomator is first of all designed to access locally stored files. In this case this wouldn't be a problem if a requested file is downloaded as a whole when it is needed, because then you can make as many filesystem calls as you want.

~As far as i know cryptofs splits up read&write operations in chunks of a certain sizes (@overheadhunter please correct me if i'm wrong). If you want to read a big file as a whole, there will be a lot of single read operations in cryptofs, but in the end you get your file. Depending on the application, An example using fictional values:~

~Even if these values would be real, for todays hardware due to optimization not a problem when everything is stored locally.~ But with Drive File Stream instead you only get the chunk which you actually want to read :

Normally when a file is accessed via drive stream google's "magic" will allow you to download only the specific portion of a file that are required for the task needed,

This means that for each call a request is send to the server. And counts into the quota.

~I don't know the exact chunk size. But by design we can't improve this situation except by allowing to use a different chunk size.~

Soo, the crucial fact here is the number of filesystem calls. I know from the dokany-nio-adpter, that for big files a lot of read requests are made. Another example:

The basic dokany mirror example is used to mirror a directory which contains a file of size ~310,148 MB. When I copy this file to another location, 424 calls to the ReadFile function were made. The dokany-nio-adapter is, like the name suggest, an adapter to fit the dokan api to cryptofs. Therefore, using the example you made at least 424 read calls to the Drive File Stream driver. If and how this driver caches things is beyond my knowledge, but let us assume there is no optimization and all calls are translated into a web request. Comparing this number now with the provided Drive File Stream log, this can even be the case.

Edit: Updated due to direct comment below.

overheadhunter commented 4 years ago

As far as i know cryptofs splits up read&write operations in chunks of a certain sizes (@overheadhunter please correct me if i'm wrong). If you want to read a big file as a whole, there will be a lot of single read operations in cryptofs, but in the end you get your file.

This is not entirely true. CryptoFS creates a file channel when it is asked to create one. It closes it when it is asked to close it. Between those two events the requester can read from the file. This is normal I/O behaviour for any process.

The only thing CryptoFS does, is reading a bit more than requested, as it needs whole chunks in order to do the MAC checks. Due to chunk buffering, it won't read things twice, unless cache eviction happens.

whitephoenix117 commented 4 years ago

@infeo Let's say cryptofs splits up read operations in chunks of sizes 32 KiB. Then a 1 GiB (= 1048576 KiB) file needs fantastic 32,768 calls to be read.

I am not sure how varied you can change the chunk size but depending on the use case it will take a long time to get banned by google; upto a few hours. If you could reduce the request count by 10x this might be enough not to hit google'd limits.

@overheadhunter For metadata like file name, size and modification date it is already not required to download a file.

I am rapid approaching the limit of my technical expertise. Whatever data is need in order for windows explorer to list the files in a directory, perform a search, or another application to do a library scan This could vary greatly depending on use case. Perhaps it could include the last accessed blocks of a file up to a certain size limit; hopefully this would be enough to keep certain requests local to the PC.

Making a generalization I only use Drive FS as a sync/ backup tool but as the world continue to go to the could I would expect more and more providers to adopt this streaming model; especially for enterprise. All providers would likely have these request caps to prevent abuse. The ability for cryptomator to support this type streaming use case will likely be more and more relevant as time goes on.

infeo commented 4 years ago

Cryptomator breaks Drive FS's ability to "resume" downloads; creating a massive number of requests to google's servers and will result in you getting banned.

What I can imagine is that Drive File Stream uses certain system features. In windows the filesystem can determine in some cases if a file is used by another program. Maybe Drive File Stream has also this ability and can continue streaming a file. If it would just some basic caching mechanism, it could detect that the same file is read twice.

infeo commented 4 years ago

@whitephoenix117 Can you make similar tests with the dokany mirror example? It would be interesting if this application using the windows API also quickly hits the limit.

I added the log of my test run with it and it can be seen, that the reads are mostly consecutivley.

whitephoenix117 commented 4 years ago

@infeo Yes. I should be able to do some testing tonight

I have tried copying files directly from the vault to a local location using windows explorer. This is completed with a single download request to google.

In this case it only triggers a single download request to google and the file transfer rate is limited by your internet bandwidth, or whatever your system bottleneck is for places with fast internet.

whitephoenix117 commented 4 years ago

@infeo

I think I got this correct, but I couldn't figure out how to get the debug version of Dokan to log. From the google end it doesn't appear that it worked.

Here is the chain of virtualization levels Drive FS --> Cryptomator --> Dokan Mirror

The file was a video, it was accessed through the M:/ directory I played the first 2 minutes Here is the google access log

image

whitephoenix117 commented 4 years ago

What I can imagine is that Drive File Stream uses certain system features. In windows the filesystem can determine in some cases if a file is used by another program. Maybe Drive File Stream has also this ability and can continue streaming a file. If it would just some basic caching mechanism, it could detect that the same file is read twice.

According to their Open source attribution Drive FS uses Dokan/ FUSE too.

infeo commented 4 years ago

Ohh, I'm sorry I was not totally clear. 🙈

I meant trying the mirror example without Cryptomator. Cryptomator is using Dokan to get an unencrypted view on your vault (the mounted drive). Mirror any directory on you File Stream Drive and access it by e.g. streaming a movie file.

Here is a small instruction how to use it: Presumably, since you use Cryptomator 1.4.15, Dokan is already installed.

  1. Open a terminal and navigate to the Dokan installation: cd "C:\Program Files\Dokan\DokanLibrary-1.3.1\sample\mirror\"
  2. Start the Dokan mirror example. The following command mirrors a directory on your file stream drive mounted on M:\ with debug output enabled and redirected to the file dokanMirror.log on your Desktop: .\mirror.exe /r G:\oogle\File\Stream\Dir /l M /d /s > > %userprofile%\Desktop\dokanMirror.log
  3. Stream the file (e.g. movie)
  4. End the program by hitting at least twice CMD+C in the terminal window.
  5. Upload the log here.

Regarding the Open Source Attribution: Interesting! But i think stacking these drivers into each other should not cause a problem.

whitephoenix117 commented 4 years ago

@infeo Ok, it it looks like there were still a lot of access requests using the Mirror, but I'm not sure there were as many as using cryptomator directly.

whitephoenix117 commented 4 years ago

@infeo

Anything else I can do to help with troubleshooting for this?

infeo commented 4 years ago

Not that i know. This feature is not very high on the prio list, so don't expect results soon.

whitephoenix117 commented 4 years ago

Not that i know. This feature is not very high on the prio list, so don't expect results soon.

Thanks. I understand you set your priority based on impact and number of affected users, and this is not very high. Let me know if there is anything I can do to contribute.

whitephoenix117 commented 4 years ago

I'm not sure this is especially useful for troubleshooting since the integration is completely different but it appears mountainduck https://mountainduck.io/ is a workaround to this issue. I am currently doing some more testing to confirm

dosentmatter commented 3 years ago

@whitephoenix117, did you reach a conclusion on mountainduck? Does it let you stream files? Does it have quota issues?

whitephoenix117 commented 3 years ago

From what I can tell MountainDuck has a different block size (not sure if this is the correct terminology) while streaming the encrypted data such that it only sends 1 google API request every few seconds opposed to multiple. For my use case this seems sufficient not to trigger any quota issues however I am not sure if this is a "fix" perhaps more of a band-aid.

P.S. Mountainduck has its own quirks. For managing on-line vs offline files/ sync it's not as good as the 1st party google software. I don't have confidence to rely on it for uploading files. Only for streaming (reading) them.

Alex E Mena

On Sun, Sep 26, 2021 at 10:42 PM Kevin @.***> wrote:

@whitephoenix117 https://github.com/whitephoenix117, did you reach a conclusion on mountainduck? Does it let you stream files? Does it have quota issues?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cryptomator/cryptomator/issues/1093#issuecomment-927469159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOBE6IOP3MYQZYI7OINONTUD7KX3ANCNFSM4MENEYKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.