Open Taekyoon opened 5 months ago
Hello @Taekyoon! I apologize for the delayed response. You've made an excellent point about the importance of Spark job configuration. I agree that it is crucial to provide clear guidance on resource management and cost implications based on default settings. We will strive to offer more detailed guidelines on this matter. Although we aim to update the documentation by mid-April, please be aware that there might be a slight delay 😅. However, we will do our best to expedite the process. Thank you for your valuable input! We look forward to your continued interest and advice on dataverse. Have a great day.
I recommend to deliver this setting to be separated in two ways, one is for cloud and another is for local. Developers can be confused with settings in different environments. Plus, when the developers are using on local envs. Their envs might be various, so it needs to describe local test envs. If these contents are included in the docs, that would be easier to use to them :)
dataverse version checks
Location of the documentation
Setting Configuration
Documentation problem
When developers use Spark Job, executor, and driver setting is very important. Depending on how many executors are used and how much memory is consumed, costs and execution time will be different. Especially for deduplication, number of executors and memory consumption is really critical to process a huge dataset.
Suggestion
Need to explicitly show how developers control executor resources, and how much cost will be used as a default setting.