Feature Request: field masks (select method)

merlinnot commented 7 years ago

Firebase Product: firestore

I'd love to see field masks behavior mimicked from https://cloud.google.com/nodejs/docs/reference/firestore/0.8.x/Query#select

As far as I know it's not a back-end limitation. If that's the case, could you please tell me if it is somewhere on the roadmap?

wilhuff commented 7 years ago

Thanks for your request!

There aren't any limitations in the backend that prevent us from implementing this. However, the mobile and web SDKs are different enough from the REST and server SDKs that this is hard to get right for all cases.

While it's straightforward enough to send a document mask the mobile/web clients are designed to cache server data and work while offline. Things get very complicated when the same document is accessed with different projections.

Firestore clients strive to provide a strong consistency model and a key part of that model is that every document you read is the same regardless of which query contained it. This model allows for seamlessly executing different queries even when the network isn't available.

Consider an app presenting a list of books, populated by a query that selected just title and publicationDate.

Now consider what happens when your user goes offline and then tries to see only their unread books, i.e. rerunning the query with title, publicationDate where readState == 'UNREAD' . At this point since we never fetched the readState we can't satisfy the query.

While it's true that some developers can plan ahead for this, we want queries to work by default rather than being broken until you happen to test exactly these kinds of cases. Most people will test fully online or fully offline but won't have the resources to force on/offline transitions between all queries to validate that different projections don't interact poorly.

A relatively simple workaround if you really want something like this now is to split your documents around the projections you want to make. For the example above: store the title, publicationDate and unread status in a document at the top-level of a collection and put details in a subcollection.

We may be able to revisit this in the future, but at this point have no plans to do so.

merlinnot commented 7 years ago

Thank you for the comprehensive answer. I appreciate the time you spent writing this.

merlinnot commented 6 years ago

@wilhuff May I kindly ask to reopen this issue since I’ve came across a legitimate use case in a real-world scenario?

I have an application which allows users to chart various data. Usually they don’t chart the same data for the same period of time twice. Docs store ~120 parameters, users usually need 2-3. Given the number of documents they might request, the load time is unacceptable.

Currently I have to write a cloud function which gets the docs using select and simply returns the result. It’s not being cached anyway.

Since the biggest problem is with caching, can we enable this method in the SDK omitting the cache? The result in my use case would be exactly the same, but development would be easier.

wilhuff commented 6 years ago

That's essentially the use case we had in mind when considering this before. The difficulty is that we haven't come up with a way of distinguishing within the API that certain actions are unsafe if you want to be able to work offline and haven't rationalized the interaction between these features and the local cache.

"We may be able to revisit this in the future, but at this point have no plans to do so" is still where I'd put this issue.

Thaina commented 6 years ago

@wilhuff I still want this tobe opened. And while you present the consistency of data. The actual use case is mostly about get and we could put the mask API there in the meantime, better than nothing

On the contrary to onSnapShot, get api don't need consistency and we have many usage around it especially on admin monitoring that we would need to query large amount of document but might need just only some field

So I propose

query.get({ documentMask : ["field0","field1.abc"] });

And it's fine if we don't have onSnapshot. But I would like to add that the existence of select API has more use case and lower our monthly cost which more precious than your safeness. The people who care about safety should not use select api and that's all

Actually any times we query more than 50 documents it's time when we do some batch operation which not really need offline consistency. So whenever we specified select it should just be ignored when offline. The point where we need offline the most is when we get at most 30 owned documents per device

Offline capability actually contradict what firestore being best at, a realtime online database

wilhuff commented 6 years ago

While I respect this point of view our problem as API designers is that there's no clearly one right way to do this now. For your use case it makes sense to have a select API that's online-only but for many others it doesn't. For example a common use case we hear is being able to drive a list view from a select, with the expectation that detail views based on the whole item be available as well. In this case it makes no sense to make the list view online only.

For now our recommendation remains to split your documents around the different views you want to display and consider moving batch work that needs select into cloud functions or backend tasks.

jsayol commented 5 years ago

Hi there. I'd like to revisit this.

It's possible I'm not seeing the bigger picture here and I might be missing something but, as it stands right now, I don't see any reason why this shouldn't be implemented.

The way I see it, the behavior of Query.select() should be analogous to that of Query.where(). With the current implementation, what happens during a normal filtered query when you encounter a cache miss while offline? Well, exactly the same should happen if you have applied a field mask and that results in a cache miss.

Now consider what happens when your user goes offline and then tries to see only their unread books, i.e. rerunning the query with title, publicationDate where readState == 'UNREAD' . At this point since we never fetched the readState we can't satisfy the query.

That's exactly what I mean. If it's a cache miss, it's a miss and that's it. What would be the behavior if the cache is completely empty and you perform that same query while offline? Then the same should happen here.

For example a common use case we hear is being able to drive a list view from a select, with the expectation that detail views based on the whole item be available as well.

I don't think it makes sense to prevent a feature like Query.select() just because some developers might not use it correctly. If they only want to show a few fields in the list view but they expect the whole document to be available, then they should be requesting the whole document without a field mask.

Firestore clients strive to provide a strong consistency model and a key part of that model is that every document you read is the same regardless of which query contained it.

The way I see it, when you retrieve data with a field mask you're not reading documents, but only some of their fields. If it makes implementation cleaner, the documents retrieved from the backend with a field mask applied could be stored in the cache using a different key than the one that would be used by the whole document.

One possible implementation would be by adding a field mask property to DocumentKey

https://github.com/firebase/firebase-js-sdk/blob/97cc3f89283d942a99c5c3e56e29eee331a0f5df/packages/firestore/src/model/document_key.ts#L21

And then modifying RemoteDocumentCache implementations to take that into account:

https://github.com/firebase/firebase-js-sdk/blob/97cc3f89283d942a99c5c3e56e29eee331a0f5df/packages/firestore/src/local/indexeddb_remote_document_cache.ts#L103

function dbKey(docKey: DocumentKey): DbRemoteDocumentKey {
  const pathArray = docKey.path.toArray();

  if (docKey.hasFieldMask()) {
    // Adding "__mask__" means this wouldn't be a valid real path
    pathArray.push('__mask__', docKey.fieldMaskHash());
  }

  return pathArray;
}

It's a crude approach and probably not the best one, but you get the idea.

Anyway, I really think this would be a very useful feature overall. Like I said before, though, I might be missing something here so feel free to offer any other insights or limitations.

Cheers :)

merlinnot commented 5 years ago

Revisiting this issue might not be a bad idea.

I always have to take it into account when I design schemas. Most usually I split collections into two: one with “basic” and one with “additional” data.

It results in a slightly more complicated and expensive (both in terms of money and computational power) solutions. I have to merge it on the front end side, which can be tricky when you have two queries you listen to with real time updates, I have to sync it myself, query both collections on the backend side, ...

It generates quite some issues.

mikelehen commented 5 years ago

Thanks for the additional feedback! You two have sparked some new discussions on this inside the team. :-)

I'm definitely sympathetic to the use case, and I'm hopeful we're able to support this in some form in the future. But we'll need to come up with clear behavior (and hopefully a sane implementation) to deal with the presence of partial documents in the cache, possibly even partial and full versions of the same document, possibly even retrieved at different points in time. So we need to figure out how to store these partial documents, how to maintain whatever consistency guarantees we end up wanting (in the presence of multiple versions of the same document), and how to execute queries against them.

FWIW- I think we're probably much more okay with showing the user inconsistent query results (where documents are missing or were retrieved at different versions; but each document is consistent) than we are with showing inconsistent document contents (where fields are missing or were retrieved at different versions). So I think we'll end up choosing different set of trade-offs wrt to consistency guarantees.

Oh, and unfortunately there's actually another obstacle. The backend doesn't support field masks in listens yet (it only works for one-off queries). I think this shouldn't be too hard to get added, but it may take some time.

Anyway, thanks for the feedback! Keep it coming. We'll hopefully tackle this at some point, but there's enough uncertainty in the design and too many higher priorities for us to tackle it in the near-term, I'm afraid.

jsayol commented 5 years ago

These are good points, I overlooked those issues when considering this. Sorry about that!

Some are easy to solve but others might be a bit tougher.

deal with the presence of partial documents in the cache, possibly even partial and full versions of the same document, possibly even retrieved at different points in time

That shouldn't be allowed to happen. When receiving a full document, the cache entries for any corresponding partials should be immediately invalidated, since those are no longer needed. A query with a field mask can be fulfilled by the full document.

It gets a bit trickier when a partial document is retrieved. In that case, the cache should be checked for its corresponding full document. If one is present, keeping both could create consistency problems, with different queries returning different data for the same documents. The overzealous approach would be to immediately invalidate the cached full document. That would provide consistency but it might be discarding perfectly good cached data. A better approach, although more computationally intensive, would be to compare the fields of the retrieved partial with those present in the full cached document. If they match (deeply) then there's no need to cache the partial, and if they don't then the full document's cache is invalidated, since at this point we know for sure that its data has become stale.

This last approach could become an issue with queries that retrieve from the backend consistently large documents and/or in large numbers. A gentler approach might be to always cache partial documents and then only go through this cache invalidation process when they're subsequently queried. Plus maybe also during any garbage collection cycles, although I haven't really looked at how that's done in the firestore sdk so not sure if that's a good option.

Then there's also the case where multiple partial documents are retrieved with different field masks, which is trickier. ~~I think the best option here is to simply cache all of them.~~ (Edit: A similar cache invalidation process should be done if there's any overlap between the fields of the partials when the values of those fields are different. If there's no overlap or the overlapping fields have the same value, then it's ok to keep both partials.) Trying to combine them is a big no-go, since the fact that they were retrieved at different points in time means their contents might be incompatible (from the user's point of view) even if they have some of the same fields. For querying, see my comment to the next point.

[Not ok] with showing inconsistent document contents (where fields are missing or were retrieved at different versions).

Definitely. A partial document should only be returned as part of a query that has the exact same field mask applied to it as when it was retrieved, or a subset of it. And that should also include any fields used for conditions (where()) and ordering.

The backend doesn't support field masks in listens yet (it only works for one-off queries). I think this shouldn't be too hard to get added, but it may take some time.

One option would be to throw an error if someone calls onSnapshot() on a query that has had a field mask applied to it, but that might create a bad developer experience. From their point of view, it wouldn't make sense why a query would work with get() but not with onSnapshot().

An alternative would be to allow it, but instead of applying the field mask when retrieving the data, simply retrieve whole documents and then apply it locally when creating the DataSnapshot. The problem though is that this might trick users into thinking they're sending much less data over the wire than they actually are, which could lead to unexpectedly-high bills. So probably not a good idea.

The best option would be to simply wait until the backend supports it. If that ends up not being possible for whatever reason, then this could be revisited.

mikelehen commented 5 years ago

Just to follow up on this, I pretty much agree with all of your points / suggestions. :-)

I'll add that the SDK already has a notion of "limbo document resolution" for when it knows that it's version of a cached document disagrees with the backend. So we could decide that when we get a partial document that's inconsistent with an already-cached full document, then we perform a limbo resolution on the full document so that our cache is consistent again. I'm not sure if this is the best approach (depending on the use case, it could result in a lot of extra lookups), but it's an option to consider.

At a high-level, I think it's probably possible to support select() in a reasonable way, but as this thread demonstrates, there's a lot of complexity and several design trade-offs to be decided upon. So it's definitely not an "easy" feature to implement. And so right now we are focusing in other areas that we see as more pressing. But hopefully we'll be able to tackle select() in the future. Sorry that it probably won't happen in the super near-term.

Thaina commented 5 years ago

@jsayol It's fine to have select method only on get. We need it. Please reopen this

mikelehen commented 5 years ago

Reopening this since we would like to expose this functionality somehow, but to recap my last post:

At a high-level, I think it's probably possible to support select() in a reasonable way, but as this thread demonstrates, there's a lot of complexity and several design trade-offs to be decided upon. So it's definitely not an "easy" feature to implement. And so right now we are focusing in other areas that we see as more pressing. But hopefully we'll be able to tackle select() in the future. Sorry that it probably won't happen in the super near-term.

Nilegfx commented 5 years ago

I understood "some" of the complexity you mentioned in this post, the rest I didn't understand at all, however, I wonder, how come to a feature like this is not considered in the first place giving that it is more important to have it in clients more than server SDK!

As a client, I am more concerned to reduce my payload than the server.

the question now, is there any update regarding this feature, or am I missing it in the documentation?

mikelehen commented 5 years ago

@Nilegfx It was definitely considered to be implemented for the mobile clients, but the complexity (as explained above) made it infeasible, while there was very little complexity on the server SDKs, so we implemented it there.

Unfortunately, there is no update. This is still something we'd like to try to tackle, but given the complexity, it's not at the top of our priority list right now. Sorry!

olegdunkan commented 5 years ago

It must be implemented!!! It will saves our money. It was one of the first question that raised when I start to use firebase. I want to preview, for example, collection with more then five fields in each document in it but I need only one in preview. Of course I can split each document and using collection group but for some users I have to show in list some fields for other some other fields and combinations might be any.

hboylan commented 4 years ago

I understand there is a lot of complexity involved in supporting the .select() query method to return a specific subset of document fields. I'm happy to see that the issue is still open after all this time as it would be a valuable feature to many firestore users. 🙏

rhuffy commented 4 years ago

+1 on this. For our use case, we store personal information in user documents and would like to display some information like name and profile photo to all users, but only display contact information like email address and phone number to "friends". Even if the fields are selectively displayed by our front-end, it is quite easy to find the raw document objects in the Chrome network log.

A cloud function wrapper is our immediate solution to avoid restructuring our database, but a built-in select() method that applies a mask server-side would be much appreciated.

i0n- commented 4 years ago

+1. In the meantime, Let's move this to a backend server request.

cbazza commented 4 years ago

+1 much needed as #3235

kimroen commented 3 years ago

For reference, there is a similar issue for iOS here that was closed and locked before this one was reopened: https://github.com/firebase/firebase-ios-sdk/issues/1372

jttaynton commented 3 years ago

+1 for this. Sent in a feature request for this previously. Would be really great for web app performance where not concerned with offline data caching to be able to select only what is needed and not have to manage different document structures.

jthegedus commented 3 years ago

+1 for this. Sent in a feature request for this previously. Would be really great for web app performance where not concerned with offline data caching to be able to select only what is needed and not have to manage different document structures.

@jttaynton For that use case you should consider using the Firestore Lite package which is a much smaller bundle and built specifically for one-time data fetching without streaming. Use Firestore Lite in your API layer and cache the response.

jokasimr commented 3 years ago

If I understand correctly the Firestore admin SDK does not cache documents. Would implementing this only for the admin SDK require handling the same caching problems as in the client case?

Would be amazing to be able to

Promise.all(
    largeListOfDocuments.map(id =>
        db.collection('collection-containing-big-documents')
            .select('only-important-field')
            .get(id))
);
// ... do something with the 'only-important-field's

instead of fetching every field just to throw them away later.

rucebee commented 3 years ago

Nice solution for web to use REST API

I ended up using Firestore REST API with the query string mask.fieldPaths=none to avoid downloading all the fields: https://firestore.googleapis.com/v1/projects/myproject/databases/(default)/documents/profiles/myusername?mask.fieldPaths=none 404 errors tell me there is no such document.

https://github.com/firebase/firebase-js-sdk/issues/2309#issuecomment-559432432 https://firebase.google.com/docs/firestore/use-rest-api

grind-t commented 3 years ago

If the SDK is limiting its functionality, then this is a clear sign that it is taking over too much. I think it makes sense to add this feature to the Lite SDK and give developers the ability to implement caching on their own.

jttaynton commented 3 years ago

+1 for this. Sent in a feature request for this previously. Would be really great for web app performance where not concerned with offline data caching to be able to select only what is needed and not have to manage different document structures.

@jttaynton For that use case you should consider using the Firestore Lite package which is a much smaller bundle and built specifically for one-time data fetching without streaming. Use Firestore Lite in your API layer and cache the response.

Essentially what we did... moving to a Node/Express API and using Admin SDK so we can fetch using .select() from there. This is a better approach anyway but would have been nice in the past to have that .select() on the client. :)

evilandfox commented 2 years ago

I think this restriction won't be removed while it forces the client to make more ~~expensive~~ reads

firebase / firebase-js-sdk

Feature Request: field masks (select method) #212