Dataflow partitioning - Githubissues

Bertverbeek4PS / bc2adls

Exporting data from Dynamics 365 Business Central to Azure data lake storage or MS Fabric lakehouse

MIT License

60 stars 22 forks source link

Dataflow partitioning #14

Closed Bertverbeek4PS closed 1 year ago

Bertverbeek4PS commented 1 year ago

Currently, data is not deliberately partitioned in the dataflow. Partitioning based on a unique identifier (systemid + company) can reduce data shuffling between worker nodes and reduce execution time.

Original PR: https://github.com/microsoft/bc2adls/pull/108

Bertverbeek4PS commented 1 year ago

@Arthurvdv can you have a look?

Arthurvdv commented 1 year ago

I've applied this to the pipelines of Synapse of my test environment, where I didn't notice any significant improvement. This could be the low volume of data, where a larger set of data would be more significant for this change. Then again, I didn't encounter any issues, so including this will not break anything and could only be beneficial, so looks good!

Bertverbeek4PS commented 1 year ago

I've applied this to the pipelines of Synapse of my test environment, where I didn't notice any significant improvement. This could be the low volume of data, where a larger set of data would be more significant for this change. Then again, I didn't encounter any issues, so including this will not break anything and could only be beneficial, so looks good!

Ok thanks for trying it out. So if you approve the pull request then it will go in 😄