Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.57k stars 2.79k forks source link

GROUP BY and aggregations for Azure Cosmos DB SQL API client library for Python #25868

Open James-Beckwith opened 2 years ago

James-Beckwith commented 2 years ago

Is your feature request related to a problem? Please describe. I want to be able to send queries with groupings and aggregations to automatically summarize the information in my cosmos instance table for various, scriptable instances via the Azure Cosmos DB SQL API client library for Python. The field I want to group by is variable and is one of a list of items (all of which I want to group by). Doing this via the portal is possible but slow and tedious and hard to extract the results into a meaningful form. I would like to script a loop whereby I change the group by item, run the query, and store the results after some post-processing in python.

Describe the solution you'd like I can send a query to the query_items method of a containerProxy class which includes GROUP BY and COUNT statements and the query is run against the cosmos db and results returned.

Describe alternatives you've considered Not using cosmos db.

Additional context This is the same as https://github.com/Azure/azure-sdk-for-python/issues/20497 which is closed without being resolved and without any update in almost a year.

ghost commented 2 years ago

Thank you for your feedback. This has been routed to the support team for assistance.

catalinaperalta commented 2 years ago

Thanks for reaching out @James-Beckwith! Adding @simorenoh to help out with this issue.

simorenoh commented 2 years ago

Hi @James-Beckwith, thank you for filing this feature request and for using our SDK!

Optimizations to our querying capabilities (like these) are present in our backlog, but we have not been able to get to it yet. I'll make sure to keep you posted as we progress and as any updates develop on my end.

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @simorenoh, @gahl-levy, @pjohari-ms, @JericHunter.

Issue Details
**Is your feature request related to a problem? Please describe.** I want to be able to send queries with groupings and aggregations to automatically summarize the information in my cosmos instance table for various, scriptable instances via the Azure Cosmos DB SQL API client library for Python. The field I want to group by is variable and is one of a list of items (all of which I want to group by). Doing this via the portal is possible but slow and tedious and hard to extract the results into a meaningful form. I would like to script a loop whereby I change the group by item, run the query, and store the results after some post-processing in python. **Describe the solution you'd like** I can send a query to the query_items method of a containerProxy class which includes GROUP BY and COUNT statements and the query is run against the cosmos db and results returned. **Describe alternatives you've considered** Not using cosmos db. **Additional context** This is the same as https://github.com/Azure/azure-sdk-for-python/issues/20497 which is closed without being resolved and without any update in almost a year.
Author: James-Beckwith
Assignees: simorenoh
Labels: `feature-request`, `Cosmos`, `Service Attention`, `Client`, `customer-reported`, `cosmos-v4-python`
Milestone: Backlog
w0ut0 commented 1 year ago

Is there somebody that wrote a wrapper/package with some more advanced support for CosmosDB in Python (GROUP BY, PATCH,..)?

MaxTh0ma1s commented 1 year ago

Was trying to run some group by queries today and was perplexed to see that it not supported via python.

Hopefully this will be fixed in an upcoming release asap!

miqm commented 1 year ago

@xiangyan99 @simorenoh any update on ETA?

simorenoh commented 1 year ago

@miqm Thank you for showing interest in this feature. We currently have an ETA for September for this to be rolled out, I will keep updating any issues referring to this as we develop the changes.

lucasdasilva0404 commented 1 year ago

News this issues?

stepbot commented 1 year ago

We are now coming up on 2 years since the original ticket #20497 was opened. Is there any progress to report? Are we still aiming to get this by September?

MarcelKuhn commented 1 year ago

How can this be possible? Doesn't that belong as a top priority? I have a very large cosmosDB running, without group by the DB is not usable for me.

Is there an update?

masonperalta commented 1 year ago

Hi @simorenoh, are there any updates with this? I still see GROUP BY as being listed under limitations in 4.5.1 (2023-09-12). Thanks in advance!

MaxTh0ma1s commented 10 months ago

Hello @simorenoh,

Any sign of a fix on this issue?

Do MS genuinely ever plan to support cosmos group bys via this python API?

LeszekKalibrate commented 8 months ago

2024 AD and still not implemented

simorenoh commented 8 months ago

Hi everyone, hope you've all been doing well. Like you all say, unfortunately this has yet to be put together - a lot of things have changed for us in the team since I last provided an update.

This feature has been in our internal conversations at the very least on a monthly basis, during which time we have had several asks from customers and leadership to work in some of our most recently released features we were missing instead (hierarchical partitioning, transactional batch, made some documents for bulk workarounds, etc) and as such we have had to re-prioritize this.

Like I did mention, however, we have discussed this feature internally - and I can share some of the conclusions of these.

The development of the feature is not only non-trivial, but doing a group by query by anything other than partition key would require shuffling documents from the service to the client - and only then we would be able to group the documents. This would require high client resources (memory), would be very expensive RU-wise (since we do the aggregation client-side we'd have to fetch a lot of additional documents from the service) and would have high latency because the documents would need to be shuffled on the client like previously mentioned.

Long story short, we have not forgotten this, but we have been unable to get to it - we apologize for the delays, and as usual I will make sure to keep this thread updated as we reach conclusions on what we should do. Thank you all for being so patient and attentive.

macnotes commented 8 months ago

simorenoh, I've been awaiting this because it could save my app a ton of time and tons of data transfer. I'd gladly pay for the RUs in exchange for the efficiency.

That said, I really appreciate that you took the time to share this update and your thoughtful explanation. Thank you.

MaxTh0ma1s commented 8 months ago

But groups bys are currently possible via cosmos web front end using a sub query?

What can't the python api at least offer same option to pass in a query string containing the same sub query?

stepbot commented 8 months ago

Really appreciate the update here.I want echo max here though and say that there is some lack of clarity to me as to why this is possible on data explorer/javascript? But not in python?Would be great to get a better understanding of the differences between the two. Is that already happening clientside somehow, even though the group by is in the query string? I mean if the takeaway message is it can’t be done server side then that’s worth knowing. I have written client side joins that serve the purpose but they are both not in the query strings and they are slow as you mentioned. It works but it isn’t great.If there won’t be/isn’t anything that can be done to improve them server side then asking for this really feels pretty pointless and mainly serves as a quality of life improvement in terms of being able to make a single request rather than two consecutive requests. Stephan BotesOn Jan 31, 2024, at 4:58 PM, Max Thomas @.***> wrote: But groups bys are currently possible via cosmos web front end using a sub query? What can't the python api at least offer same option to pass in a query string containing the same sub query?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

LeszekKalibrate commented 8 months ago

Yeah.. that's the question - if CosmosDB DataExplorer can handle it, why not Python? SELECT c.deleted, COUNT(1) AS Count FROM c where c.tenantId ='1' and c.type='Car' GROUP BY c.deleted

Response: [ { "deleted": true, "Count": 1 }, { "deleted": false, "Count": 181255 } ]

simorenoh commented 8 months ago

I am not entirely positive on what the discussion is with the use of the subquery you are all mentioning - would be happy to find out more about this if you all have some samples or information I can use. If you are talking about doing something like this top rated answer though: https://stackoverflow.com/questions/45965437/documentdb-error-cross-partition-query-only-supports-value-aggreatefunc-for

Where you might wrap your query from something like this:

'SELECT top 5 c.person_name,max(c.running_speed) FROM c where c.place = "winner" group by c.person_name'

To something like this:

'SELECT top 5 VALUE m FROM (SELECT c.person_name,max(c.running_speed) FROM c where c.place = "winner" c. group by c.person_name) as m'

Yes, something like that might run and return results in the Python SDK - but it still faces the problems that I previously mentioned. Running a group by query in this manner does not guarantee that you will receive all of the results that you are looking for from the backend - that is to say, it might "run" but it does not necessarily "work". If the results are spread throughout more than one partition, the aggregation would still be missing on the client side.

I know there's some remarks we point out here with limitations that are present to group by in general too, not sure if these apply to what you all are talking about though: https://learn.microsoft.com/azure/cosmos-db/nosql/query/group-by#remarks

If anyone would like to provide more information on this I'd be more than happy to learn more as well.

simorenoh commented 8 months ago

@LeszekKalibrate the changes were made to the JS SDK (that serves as the code behind the Data Explorer) back when there was limited funding in the Python SDK - as such the Python SDK has been lagging behind its other counterparts for some time.

github-actions[bot] commented 5 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @AbhinavTrips @bambriz @pjohari-ms @simorenoh.

maxiedaniels commented 5 months ago

It's wild that these basic queries aren't supported by the python library.

github-actions[bot] commented 1 month ago

Hi @James-Beckwith, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.