Open subashsivaji opened 3 years ago
Hi,
There is currently no support for the partition elimination using the connector.
I'm marking it as a feature request. @billgib /@euangms can provide more details about future feature development work.
Thanks, Tim
@billgib /@euangms Any ETA on this? Due to this limitation we had to fall-back to Azure Data Factory mapping data flows CDM connector. But we wanted to use spark-cdm-connector for our client.
We are currently going through our next planning cycle, this request is tracked as 978616
I second subashsivaji's request. This would be a great feature for big data platforms where the data volume for a given entity is high and although only records in the past year or so gets modified. In this case, we end up reading the complete data of the entity and then do a filter on the dataframe for further processing. This induces a potential performance bottleneck down the line as the data increases. Hence it would be great if this can be implemented as feature.
As I interpret the documentation there's room for conjecture that the CDM object doesn't need to actually contain the data but is a wrapper over the Delta Lake. Is this true? If so, what compute resources process the queries?
On Thu, Jan 28, 2021, 9:21 PM reynoldspravindev notifications@github.com wrote:
I second subashsivaji's request. This would be a great feature for big data platforms where the data volume for a given entity is high and although only records in the past year or so gets modified. In this case, we end up reading the complete data of the entity and then do a filter on the dataframe for further processing. This induces a potential performance bottleneck down the line as the data increases. Hence it would be great if this can be implemented as feature.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Azure/spark-cdm-connector/issues/59#issuecomment-769529139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWPJK3A25BH75WPIA5KN6LS4ILRBANCNFSM4TXUESKQ .
@SQLArchitect Spark CDM connector currently doesn't support Delta. We built a workaround to use CDM (schema only) with Delta Lake for our SaaS product.
@TissonMathew I would love the opportunity to speak with you about this. Whether or not it supports Delta, what is providing the compute? Is it the M engine running on a multi worker node Spark cluster?
Are there any news for this delta feature? We would also like to use it.
These are not Delta Lake specific.
On Fri, Apr 16, 2021, 6:08 AM InitionsJulius @.***> wrote:
Are there any news for this delta feature? We would also like to use it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/spark-cdm-connector/issues/59#issuecomment-821070106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWPJK65Q4MWU7N2F6465PDTJAEDPANCNFSM4TXUESKQ .
Hello, wanted to follow up if there's any news for this partition elimination feature? It would be helpful if we can specify the partition during the load and exclude the rest partitions. @bissont
Hello, I would also very much like to hear if there are any plans to implement partition elimination?
We are using, Databricks runtime 6.6. (Spark 2.4.5) com.microsoft.azure:spark-cdm-connector:0.18.1
When reading data from cdm folder - how do we do partition elimination using spark-cdm-connector? Example, I have contact entity/table in cdm folder which contains 2 million records.
In non-cdm scenarios,,
Using cdm folder structure and spark-cdm-connector - when we do the following - everytime this is going to scan 2 million records - load it into dataframe and then does the filter based on modifiedon. This is of course not scalable. Are there any alternatives? Any better way to do partition elimination? Please suggest?
Please see plan - there is no partition elimination.
Scan and then filter