Add `maxCount` for StorageBuckets

ayuishii commented 3 years ago

Allow user agents to decide a maximum number of Storage Buckets for an origin. maxCount attribute will inform developers of the maximum number of buckets an origin is allowed to have at any one time.

asutherland commented 3 years ago

I very much share the concern that it needs to be clear to site authors that the number of buckets isn't something they need to be particularly concerned about, lest they bias towards using a small number of buckets and defeating the benefits of multiple storage buckets.

Also, it's not clear from the text here if this is intended to be an implementation hard limit or some type of quota mechanism?

At a higher meta level, it seems like it would be great to have some guidance about what's the right size for a bucket. For example, if writing an offline music application, there are some pretty clear hierarchy levels at which to make the cut:

All: All music in 1 bucket!
Per-artist: Each artist goes in their own bucket.
Per-album/playlist: Each album/playlist goes in its own bucket. (Per discussions about Blobs and IDB de-duplication, for cases where songs are duplicated between different playlists, the browser could de-duplicate under the hood to the extent possible, but should still charge full quota usage for each bucket.)
Per-song: Each song goes in its own bucket.

My intuition is that per-album/playlist is the right balance.

More general guidance for when there isn't as clear domain alignment would be that buckets make sense as soon as the amount of data we're talking about reaches 10 MiB. This might be an alternate means of dealing with the bucket limit issue. We define buckets to take up a minimum quota usage of 10MiB (or other value) and that therefore you may be limited in how many buckets you can create by the quota granted to your origin through implicit and explicit user interaction.

ayuishii commented 3 years ago

Thank you for the thoughts @mkruisselbrink & @asutherland ! Added some of my personal thinking here, but would be awesome to get alignment on this point 🙂

An implementation supporting 2 buckets is likely quite different from an implementation supporting 1000 buckets.

Thats a good point... thanks for pointing this out. Anything too low seems like it would disincentives the API usage, but too many would also create a poor experience. Instinctively I think 10 seems like a reasonable limit. But open to thoughts.

if this is intended to be an implementation hard limit or some type of quota mechanism?

My initial intent was to have a hard limit, so an origin won't be able to abuse the API by creating thousands of buckets, which could affect the performance of other sites.

At a higher meta level, it seems like it would be great to have some guidance about what's the right size for a bucket.

Thats a good point.. In your example for the music application, my personal thoughts are that buckets would be divided in bigger groups. Buckets divided by function & importance / expected life. A bucket for user's personal playlists that are in heavy rotation (that you'd prefer never to be evicted), a bucket for recommended playlists for the week (that may expire after a week) etc. But something with a completely different function like analytics would have its own separate bucket that can be deleted/evicted independently.

But at the same time I also wouldn't want to add something that would disincentivize its usage. Whether by hard limit or quota mechanism, curious about what you think on how many buckets you'd expect an origin to be able to have at any one time?

We define buckets to take up a minimum quota usage of 10MiB (or other value) and that therefore you may be limited in how many buckets you can create by the quota granted to your origin through implicit and explicit user interaction.

How do you see the creation limit being expressed in this scenario? Do you see it erroring on bucket creation once it has been reached?

asutherland commented 3 years ago

An implementation supporting 2 buckets is likely quite different from an implementation supporting 1000 buckets.

Thats a good point... thanks for pointing this out. Anything too low seems like it would disincentives the API usage, but too many would also create a poor experience. Instinctively I think 10 seems like a reasonable limit. But open to thoughts.

One of my take-aways from discussions in the ServiceWorkers WG was that teams within a company that operate sub-sites within a single origin may not operate under a global coordination scheme. Having developers have to worry about how to divvy up a resource that there's potentially only 10 of seems like it would encourage people not to use buckets except in very exceptional cases.

My initial intent was to have a hard limit, so an origin won't be able to abuse the API by creating thousands of buckets, which could affect the performance of other sites.

I think having buckets have a minimum quota cost seems like a more dynamically scalable situation than a hard limit, while addressing scenarios where a site might try and use buckets as a means of data storage that isn't charged against quota. If a user really wants a site to use 100 GiB of storage, should that site be limited to the same number of buckets as a random site the user has never visited before?

Thats a good point.. In your example for the music application, my personal thoughts are that buckets would be divided in bigger groups. Buckets divided by function & importance / expected life. A bucket for user's personal playlists that are in heavy rotation (that you'd prefer never to be evicted), a bucket for recommended playlists for the week (that may expire after a week) etc. But something with a completely different function like analytics would have its own separate bucket that can be deleted/evicted independently.

But at the same time I also wouldn't want to add something that would disincentivize its usage. Whether by hard limit or quota mechanism, curious about what you think on how many buckets you'd expect an origin to be able to have at any one time?

Expect? Unsure. Want? (quota usage) / 10MiB.

My goal at this time would be for the browser to have maximally granular choices to make under storage pressure about bucket discarding. An origin that has 2x 2GiB buckets and 1x analytics bucket that they have an interest in heavily gaming to ensure it never gets cleared doesn't provide a lot of options, especially as access patterns would most likely touch every bucket during every session. An origin that has 40x 100MiB buckets that could likely have accurate MRU-dates associated with each of them would be amazing because it lets a naive bucket discarding algorithm make a lot more clear-cut less-risky decisions.

How do you see the creation limit being expressed in this scenario? Do you see it erroring on bucket creation once it has been reached?

I think that in general sites will fall into 2 categories:

Not really aware of quota, just hopes for the best and only incidentally handles quota errors via normal error handling paths. Most sites would fall into this category (and this is reasonable and should be a goal that most sites don't need to care about quota). (For example, the ServiceWorker Slack serves Firefox doesn't seem to care about quota at all and assumes it will never encounter problems getting up to ~900 MiB of quota. It only considers cleaning up caches after successfully installing ~210MiB of new data.)
Really aware of quota. Incredibly rare.

For the first, common case... as the origin asks for more buckets that exceed the quota we're willing to give it, we'd start discarding buckets from the origin. In the lead-up to discarding the origin's own buckets, this might involve discarding some buckets from other origins first.

For a site that's very aware of quota, we'd potentially have the following 2 events we might be able to tell it:

"buckets-discarded": Hey, we discarded some buckets either just now or at some point in the past (possibly while your origin didn't have an active global with an active listener for this event).
"buckets-discarding": Hey, you just asked us for a bucket but we don't have room for it without getting rid of some of your buckets. If you want to handle the cleanup yourself, call waitUntil() on this event with a promise that you'll resolve when you're done with the cleanup. Then we'll re-evaluate the most recent openBucket call. (Note that this would never be used to wake up an origin and tell it to clean itself up; I believe there is consensus we would never wake up a ServiceWorker to give it an opportunity to respond to storage pressure because that would be the worst time to do it, has privacy implications, and undercuts any motivation for sites to use buckets responsibly.)

Our handling would be the same, except we'd fire the "bucket-discarding" event and potentially wait for it to finish.

wanderview commented 3 years ago

One of my take-aways from discussions in the ServiceWorkers WG was that teams within a company that operate sub-sites within a single origin may not operate under a global coordination scheme. Having developers have to worry about how to divvy up a resource that there's potentially only 10 of seems like it would encourage people not to use buckets except in very exceptional cases.

I don't recall these partners showing interest in using buckets for isolation between product teams. My impression is they don't have too much trouble doing that with database naming conventions, etc. Maybe it would make them a bit less concerned about using too much disk, but they seem more concerned on user impact there and less on impacting another team. Finally, I think there are cross-product integrations that would want everything to be in the same quota bucket to avoid some data disappearing, etc.

Edit: Note, the service worker discussions took place because they didn't have an equivalent method of isolation to our database naming, etc.

Its a question we could ask them more directly, though. @ayuishii, what do you think?

asutherland commented 3 years ago

The hypothetical I was thinking of was more like a team thinking: "If I risk using buckets but some other sub-site has used up some of the very finite allowed number of buckets, then my sub-site can break, so I just won't use buckets." I'm very confident in teams being able to prefix their storage names to avoid conflicts. (Also, I was thinking of comments by non-Googlers.)

It would definitely be interesting to hear what those partners think about the possibilities of having the ability to create a lot of buckets, especially from sites that are only intermittently used. I would expect sites that see daily usage and/or are continually opened in pinned tabs to not need to worry about bucket discarding due to storage pressure and not want to deal with the overhead of using a bunch of buckets. I would expect intermittently used sites are more likely to be interested in more graceful/granular discarding.

It'd also be interesting to know whether, if using a ton of buckets isn't appealing, if letting a bucket opt-in to Cache-granularity discarding would be an acceptable trade-off. Continuing with the offline music player scenario and @ayuishii's bucket usage proposal, it would meet my idealized granular quota dreams if the "recommended playlists" bucket used a separate Cache for each of these playlists, but then the site could still use CacheStorage.match() to not have to deal with the partitioning. There has been some discussion in https://github.com/w3c/ServiceWorker/issues/863 in this area albeit more focused on per-Response LRU eviction.

All that said, I'm open to the idea that the reality is that multiple storage buckets might only be used like conceptual fire safes/lock-boxes/flight data recorders where the expectation is that:

sites won't pay attention to their quota usage and mainly use the default bucket
but sites will put important (login) info in the fire safe bucket whose usage is kept small and marked as important
sites might keep minimal offline ServiceWorker core(s) in a bucket too
sites understand that the default bucket will periodically be discarded because browsers under storage pressure don't have a lot of choices when there are only a bunch of potentially large buckets.

wanderview commented 3 years ago

Ah, sorry. I misinterpreted your concern as folks wanting a separate bucket for every product.

How many buckets per origin do you expect to be reasonable in practice? I think we are trying to reason about per-bucket overhead and we might come to different conclusions for less than 10 buckets-per-origin vs 1000s of buckets-per-origin. For example, do you have a separate database internally for every bucket vs a single database that has a column for bucket-id, etc.

asutherland commented 3 years ago

My primary concern is the UX related to quota. If origins use a lot of buckets that are moderately sized, it becomes easy to grant origins quota incrementally via moderately sized buckets and easy to reclaim quota incrementally and without user involvement or sites constantly appearing to forget everything on a device with limited storage. Ideally we will adjust our storage implementation to whatever provides for the best UX for this and can survive the reality of the usage patterns of the web.

Do other browsers have documentation about their existing or planned quota management strategies, particularly expected behavior when operating with limited storage and whether user prompting is/will be involved? (Edit: To be clear, my ongoing plan has been to pin all my hopes on multiple storage buckets.)

ayuishii commented 3 years ago

Thanks for raising concerns here. I agree that I think it would be valuable to gather more developer feedback here on how it will be used. The API is very much still in early stages and wouldn't want to move forward if the API design doesn't match the use cases. I'm thinking Origin Trial would be a good opportunity to do gather this feedback. Does this sound reasonable?

Do other browsers have documentation about their existing or planned quota management strategies, particularly expected behavior when operating with limited storage and whether user prompting is/will be involved?

I haven't been able to find any documentation from other browsers for their quota management strategies. This page is the best resource I'm aware of for quota management comparisons.

asutherland commented 3 years ago

I found the Chrome Web Storage and Quota Concepts doc while reading some other Chrome proposals which nicely characterizes current LRU data-clearing (which is also what Firefox currently uses). Many thanks to the authors of that doc and kudos on the many explanatory diagrams!

evanstade commented 1 year ago

There's a good amount of quality discussion here covering the intended use cases of buckets in general. I opened #60 to focus on just limiting the number of buckets used (which might or might not be exposed via something like maxCount). @asutherland wdyt?

As far as some of the other ideas here, such as firing events when a site asks for a bucket and there's no room, I think the simplest thing to do for now is to stick with what we have --- QuotaExceededError when a site tries to store and there's no more room. The site can clean itself up and try to create a bucket again on its own, just as it can free up space then try to store more things in IDB on its own. But if there is demand for this kind of thing in the future, we can consider extending the API later.

We also see #44 as a topic of much interest.

asutherland commented 1 year ago

I've replied on https://github.com/WICG/storage-buckets/issues/60. I'd actually seen the comment when it was made and had begun composing a response but lost it in tab bankruptcy; apologies!

As far as some of the other ideas here, such as firing events when a site asks for a bucket and there's no room, I think the simplest thing to do for now is to stick with what we have --- QuotaExceededError when a site tries to store and there's no more room.

But if there is demand for this kind of thing in the future, we can consider extending the API later.

Yeah, I wouldn't worry about adding new events at this point. Also, my sketch there is along the lines of @wanderview's corruption reporting proposal and we'd want to integrate with that.

ayuishii commented 1 year ago

Closing as we've decided not to expose maxCount. We can re-visit if there is request for this in the future. In the meantime, we'll plan to throw a QuotaExceededError when a sites tries to create too many buckets, and add some text to the explainer and/or spec.

Thanks for the discussion!

WICG / storage-buckets

Add `maxCount` for StorageBuckets #36