datastacktv / data-engineer-roadmap

Roadmap to becoming a data engineer in 2021
https://datastack.tv
12.39k stars 1.32k forks source link

Concurrency models are missing #49

Open Vlad-Radz opened 3 years ago

Vlad-Radz commented 3 years ago

For a modern data engineer knowledge of concurrency models is important.

  1. A data engineer should know the difference between concurrency and parallelism.
  2. A data engineer should know the difference between task parallelism and data parallelism.
  3. Threads vs. processes. Example in Python: libraries threading vs multiprocessing, what are the differences, and what problems does Python have with threading.
  4. A pretty typical scenario for modern data integration: call n APIs each x sec / min / hours. How to do that with a good performance? One of the ways would be to use asynchronous programming.
  5. Actor model might be good to know as well.
  6. DAG (example: Apache Airflow) vs state machines (example: Amazon Step Functions) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer.
  7. Parallel programming using techniques like CUDA on GPU.
  8. Functional programming is also 'nice to have' (but not obligatory).

If you agree on at least some of the points, I can prepare the text.

alexandraabbas commented 3 years ago

Hey, these are really good points! I'll def consider adding these to the image when I update it next time. Feel free to create a PR and add it to the markdown version. Thanks a lot for the contribution!

Vlad-Radz commented 3 years ago

Hey, thanks for the feedback! I will create the markdown version, sure!