NLeSC / guide

Software Development Guide
https://guide.esciencecenter.nl
Creative Commons Attribution 4.0 International
47 stars 30 forks source link

Dataset chapter #328

Closed suvayu closed 2 months ago

suvayu commented 6 months ago

Below, describe what this Pull Request adds:

This PR removes the database section from the Python guide (as discussed in #316), and introduces a new chapter on handling datasets. It discusses using local databases, and other data processing libraries, and respective trade-offs.

maltelueken commented 5 months ago

Nice! I have also used DuckDB in combination with dplyr in R, so I might add something about using data bases in R to the R language guide.

suvayu commented 5 months ago

Hi @maltelueken that would be amazing! This also addresses the last point in the DuckDB part about combining with other tools. We were also lacking R experience, so couldn't comment on R libraries.

bouweandela commented 4 months ago

@Morrizzzzz Would you be interested and have time to review this?

recap commented 2 months ago

The chapter could be more about data engineering i.e. how to use these tools or best practices for ETL pipelines.

egpbos commented 2 months ago

@recap do you have some resources to link to on data engineering and/or ETL pipelines? Sounds like a nice addition (for a new PR). We should try to restrict it to techniques/concepts we actually (can) use in projects. I think you have done some of that, no?

egpbos commented 2 months ago

Also, @recap your suggested additions sound good, but did you also review what was already in the PR and whether it makes sense? Then we can merge this PR as it is now and do your additions in a next PR (or quickly add them to this PR if you want, I think @suvayu is on holiday anyway).

egpbos commented 2 months ago

Thank you so much @suvayu & @f-hafner for taking this initiative and @recap for the great review and additions.

... One final thing before merging is to add it to the sidebar menu, though :) I'll do that right now...