dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

dat-index #4

Open max-mapper opened 10 years ago

max-mapper commented 10 years ago

given that so much scientific data is on ftp/http open directories, it would be nice if we had a tool that could take a list of files, or a root + do the spidering/traversal, and index them in dat such that the blobs appear to exist in dat but they are actually only stored in the original location

pros:

cons:

when blobs are indexed dat basically acts as a proxy. when you index a file dat should probably hash the file and store the hash in it's metadata. when you replicate indexed data there should be an option to do a replication with blobs or without blobs

this can definitely be written as a standalone CLI tool outside of dat for experimentation purposes

we can do cool stuff like e.g. dat-index --watch which would update the index metadata of a folder whenever files are changed

transcranial commented 8 years ago

@maxogden Hi Max, just wondering -- did this idea ever get further developed at all? Currently we're doing a lot of fetching/syncing with ftp://ftp.ncbi.nlm.nih.gov, and something like this would be incredibly useful.

max-mapper commented 8 years ago

@transcranial Hi! I've been making slow progress. The design of Dat itself has evolved a lot since I opened this issue, but I still think the idea of this issue is still very accurate and actually would be a lot easier to today as compared to Dat 1.5 years ago.

I've been working on a crawler: https://github.com/maxogden/electron-microscope

Also I should mention @bmpvieira has some NCBI specific tools under the https://github.com/bionode project.

Right now Dat can do a static snapshot of a version of a set of files, but for file sets that change we are still working on a "dynamic" mode for Dat where you get a single Dat link but can subscribe to data changes. Currently Dat links only describe the exact files at the time you create the link.

You should definitely hang out in our Gitter room, or the Code for Science room as well, there are some good discussions related to this topic happening in there lately

https://gitter.im/codeforscience/community https://gitter.im/datproject/discussions