Refactor data handling - Githubissues

rabernat commented 3 years ago

We might want to use a utility like Pooch or a catalog like Intake to simplify how we are handling data. We could make things a lot easier on the students. On the other hand, perhaps understanding how to deal with real data (urls, broken links, etc.) is a valuable experience?

tjcrone commented 3 years ago

The easiest way to handle data files is to include them, wherever possible, in the repo. I agree there is great value in having students download data from primary sources. But the URLs and sources change often, and whether or not we use Pooch or Intake, the notebooks are likely to break every year, and the professor will need to fix the link/source before the Tuesday lecture. Students will deal with broken links less often unless a fix is not made before class.

Maybe a mixed approach is best, where we include a few of the most-often used and smaller datasets in the repo, and use links to primary sources for other datasets, especially ones where it would be nice to have the latest data. I am not opposed to using Pooch or Intake, but the real problem is that primary source URLs/APIs/formats change, and I don't think these tools can fix that. An effort to find data sources and URLs/permalinks that have been consistent over the years would also be worthwhile.

rabernat commented 3 years ago

The easiest way to handle data files is to include them, wherever possible, in the repo.

I like to avoid data files in the repo if possible because it bloats the repo size.

Perhaps there is an intermediate solution: we use a zenodo repo to store all the data for the course. That way we can be sure it will be immutable.

tjcrone commented 3 years ago

I agree that data in the repo is not ideal. The notebooks will never break, but there are downsides including bloat as you note. In some places (e.g. https://earth-env-data-science.github.io/assignments/numpy_matplotlib.html) we have data files stored in the ~rpa directory of the LDEO web server. Also not ideal, but these notebooks never broke, and I really appreciated that.

I like the idea of a repo for data that we control. It could be a separate GitHub repo with data for the class, which might be somewhat large, but students do not clone it and it stays more static than the textbook repo, or a Zenodo repo which I know less about but looks great. It would be good to also maintain a few instances when students get data from primary sources, for the pedagogical upsides. We could make a good effort to find permalinks/sources that are on the more stable side.

tjcrone commented 3 years ago

Eventually, our Open Storage Network pod might be a good place. Poking around, I found Dryad (https://datadryad.org/stash/our_membership) which is also interesting. Lots of others on this list from Nature: https://www.nature.com/sdata/policies/repositories.

PedroVelez commented 3 years ago

Hi,

first, thanks a lot for making An Introduction to Earth and Environmental Data Science such a nice and useful tool. I learnt python following it, and inspired me to create https://euroargodev.github.io/argoonlineschool .

We are still working on the Argo School and I am trying to overcome the problem of having large files in the repository. So far I have them hosted in a web page, so the users of the Argo School can download them. I prefer to have the originals stored rather than having the users download the last version from Argo repositories since they change and the jupyter notebooks may not work.

I have followed this issue and I wonder if you have came out with a solution, or even if you have tried Git Large File Storage. thanks! Pedro

tjcrone commented 3 years ago

Thanks for reaching out @PedroVelez. Your book looks awesome! I'm glad this book was a help to you. I think we are leaning toward Zenodo and definitely away from large files in the repo, but I don't think we settled on a solution. Git LFS looks very cool and might be a reasonable way forward. We like Zenodo because there are DOIs, and we like to have students get data directly from primary sources so they can learn how to deal with the difficulties that are sometimes involved which are instructive. But we haven't settled on anything yet and I think Git LFS should be in the discussion. It would be great to hear about how you solve this problem as you move forward.

PedroVelez commented 2 years ago

Hi, I explored GitLFS and the user has to install it. I think its purpose is for more expert developers.

To keep it simple and adjusted to the target audience of the Argo online School, I think the best option, is to use Google drive to store the content of the ./Data folder. In that way user can see the structure of the folder and decide if download just one file or all of them. This is the example we are using any comment is appreciated. Pedro

earth-env-data-science / earth-env-data-science-book

Refactor data handling #32