datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
755 stars 91 forks source link

Impossible to clone all datasets #229

Closed amirouche closed 6 years ago

amirouche commented 6 years ago

After git cloning the repository and using npm install I get an error about missing datahub-client after manually installing it, I get another error:

Error: Cannot find module 'datahub-cli/lib/utils/error'

After doing npm datahub-cli it still fails with the above error.

$ node --version
v9.4.0
$ npm --version
5.6.0
rufuspollock commented 6 years ago

What are you trying to do exactly?

The main location for the registry for bulk access is http://datahub.io/core/registry

amirouche commented 6 years ago

What are you trying to do exactly?

I wanted to see how big the git repositories were. For full, disclosure I am looking for a better solution than git. At $WORK we use the same workflow but it fails with big datasets. We are thinking to move to a custom git backend (see 1). That said, I prefer a solution like rawbase.

rufuspollock commented 6 years ago

@amirouche these git repos aren't that big (most are under 100Mb) -- intentionally.

In general git does have issues with largish datasets (depending on how the diffs work the problems come in from 100s of MBs to GB range) . There are loads of potential solutions but all involve moving to specialized tooling (the simplest is just to store complete files in e.g. s3 with versioning turned on!). What's crucial is to get clear on your use cases 😉 -- and starting as simple as possible (it's always tempting to starting building your own "castle in the sky").

If you want to chat more our chat channel is http://gitter.im/datahubio/chat

Finally, to answer your question: to find all the core datasets look at https://github.com/datasets/registry/blob/master/core-list.csv -- and then script yourself cloning them if you want.

amirouche commented 6 years ago

Thanks.

I am very tempted to build my own "castle in the sky".

rufuspollock commented 6 years ago

@amirouche people often are 😉 -- the problem is most of them remain unfinished. If you want to help out with an existing effort you can join us with https://datahub.io/ via http://gitter.im/datahubio/chat