Uniform decentralized data storage

NeuroTechX / moabb

Mother of All BCI Benchmarks

https://neurotechx.github.io/moabb/

BSD 3-Clause "New" or "Revised" License

646 stars 167 forks source link

Uniform decentralized data storage #161

Open v-goncharenko opened 3 years ago

v-goncharenko commented 3 years ago

Some datasets' download links tend to stale and data could be lost. As we started to discuss in office hours it is nice to have own copy of all the data and be able to store it in different sources at the same time for more reliability.

Solution for that could be adding DVC as a data management system.

But main concern for this solution is licensing. Are we eligible to collect all the data in single place (we could add all license notes needed)? I'm not sure =(

sylvchev commented 3 years ago

This is an important question, that is often discussed during office hours. To summarize the previous discussions, we also consider Datalad, but it seems less appropriate for MOABB. There is also some opportunity to associate with OpenNeuro or NeuroHub. That being said, I like the DVC approach.

Regarding license, we should check each dataset and see if the data are released under a specific license. For self-hosted dataset, we need to contact the authors. It could be a good opportunity to populate the wiki page, it will help to set up a nice table in the documentation.

Div12345 commented 2 years ago

Renewing this discussion as per conversation with @sylvchev , we could mail the corresponding labs and create a copy of the datasets at a central storage from which access is easier and faster. Even options like at Zenodo and OSF may be considered while maintaining the FAIR Priniciples for better meta-data management.

v-goncharenko commented 2 years ago

That being said, I like the DVC approach.

Currently I'm trying some approaches to this DVC way, but it requires to modify Dataset interface - I think we will do that for our purposes anyway and then discuss PR to main project (I expect to visit you after New year)

It could be a good opportunity to populate the wiki page, it will help to set up a nice table in the documentation.

By the way - why do you store this information in wiki while we have documentation? I mean now it's two different places and have this table rendered on site would be nice especially for newcomers.

v-goncharenko commented 2 years ago

Regarding OpenNeuro, Zendoo and others - this is only platform to store data and in this case we need to create and maintain full code infrastructure for downloading, handling errors, progress bars and so on.

In case of DVC it is done for us and it is done great. That's why I like this approach

sylvchev commented 2 years ago

Currently I'm trying some approaches to this DVC way, but it requires to modify Dataset interface - I think we will do that for our purposes anyway and then discuss PR to main project (I expect to visit you after New year)

Interesting, do you think you could come up with a dataset interface that could handle both? If you could visit, this could be a nice occasion for a sprint ;)

By the way - why do you store this information in wiki while we have documentation? I mean now it's two different places and have this table rendered on site would be nice especially for newcomers.

I agree, this is redundant to have information stored in two different place. This is a transitive state. The objective is to have automatically generated pages to expose these information. One positive outcome of this wiki is that we will have a way to verify that the information in the generated in the documentation are correct.