📝 [Docs] - Guides to use Spark Job

UpstageAI / dataverse

The Universe of Data. All about data, data science, and data engineering

Apache License 2.0

498 stars 49 forks source link

📝 [Docs] - Guides to use Spark Job #44

Open Taekyoon opened 5 months ago

Taekyoon commented 5 months ago

dataverse version checks

[x] I have checked that the issue still exists on the latest versions of the dataverse.

Location of the documentation

Setting Configuration

Documentation problem

When developers use Spark Job, executor, and driver setting is very important. Depending on how many executors are used and how much memory is consumed, costs and execution time will be different. Especially for deduplication, number of executors and memory consumption is really critical to process a huge dataset.

Suggestion

Need to explicitly show how developers control executor resources, and how much cost will be used as a default setting.

41ow1ives commented 5 months ago

Hello @Taekyoon! I apologize for the delayed response. You've made an excellent point about the importance of Spark job configuration. I agree that it is crucial to provide clear guidance on resource management and cost implications based on default settings. We will strive to offer more detailed guidelines on this matter. Although we aim to update the documentation by mid-April, please be aware that there might be a slight delay 😅. However, we will do our best to expedite the process. Thank you for your valuable input! We look forward to your continued interest and advice on dataverse. Have a great day.

Taekyoon commented 5 months ago

I recommend to deliver this setting to be separated in two ways, one is for cloud and another is for local. Developers can be confused with settings in different environments. Plus, when the developers are using on local envs. Their envs might be various, so it needs to describe local test envs. If these contents are included in the docs, that would be easier to use to them :)