MicrosoftLearning / mslearn-fabric

This repository hosts content related to Microsoft Fabric content on Microsoft Learn.
https://microsoftlearning.github.io/mslearn-fabric/
MIT License
147 stars 108 forks source link

#10-ingest-notebooks.md Optimize Delta table writes task does not make sense #120

Closed HoergerL closed 4 days ago

HoergerL commented 2 months ago

Module: Ingest data with Spark and Microsoft Fabric notebooks

Lab/Demo: 10 - Ingest data with Spark and Microsoft Fabric notebooks

Task: Optimize Delta table writes

Step: 00

Link to Lab Instructions: https://github.com/MicrosoftLearning/mslearn-fabric/blob/main/Instructions/Labs/10-ingest-notebooks.md#optimize-delta-table-writes

In my opinion the task for optimizing delta table doesn't make sense. The results of the "Create a Fabric notebook and load external data" steps are cached so the execution of the same steps will be a lot faster, no matter what Spark Config Settings we set. Besides that the Spark Config settings, which the tasks wants to show are anyway activated by default: https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=pyspark So the code in Optimize Delta table writes does in the end the exact same thing as the code from "Create a Fabric notebook and load external data".

Instead of that it would probably make more sense to compare the already optimized code from "Create a Fabric notebook and load external data" with a version where we change the Spark Config Optimization Parameters to false after restarting the session. To have comparable results.

But besides that my understanding of the two Spark config Optimization Parameters is that writing takes ~15% longer when they are activated, so the statement "Now, take note of the run times for both code blocks. Your times will vary, but you can see a clear performance boost with the optimized code." is not true and it should even take longer with the optimized code. Only when we read the data we should see the performance boost