gdcc / pyDataverse

Python module for Dataverse Software (dataverse.org).
http://pydataverse.readthedocs.io/
MIT License
63 stars 43 forks source link

Synchronize local directory with remote folders #51

Open hannesdatta opened 4 years ago

hannesdatta commented 4 years ago

Purpose

Synchronize a local directory with a remote folder within a dataset at Dataverse.

User story

As a user of Dataverse, I would like to be able to continuously (e.g., daily, weekly) "mirror" ongoing data collections (e.g., by means of web scraping) with a (draft) version of my dataset at Dataverse. Currently, only one-time transfers are convenient to manage using PyDataverse.

Functionality

  1. obtain remote metadata of files via get_datafiles(); use as argument a particular folder at the remote dataset (or the entire dataset, default)
  2. obtain comparable metadata for local folder\ that needs to be synchronized
  3. Compare files in (1) with (2), using filenames and file hashes
  4. Generate a list of actions to bring in sync the directories: (a) copy from (1) to (2), (b) copy from (2) to (1), (c) delete in (1), (d) delete in (2)
  5. Wrap functionality in new sync_folder() function, with arguments: local_folder (default: .), remote_folder (default: .), direction (one of mirror local to remote but do not delete anything on remote; mirror remote to local but do not delete anything in local; synchronize both directories, and delete files where needed), comparison (only on the basis of file names, or also on the basis of file hashes (default: hash+filename))
skasberger commented 4 years ago

Great issue.

I have part of the code already for me running, cause there was a project, where we needed to mirrow a Dataverse instance via an API. But it's very alpha, and it is not good in terms of completeness and maturity.

Here some thoughts about this: Before the synchronization can be done, the from_json() and to_json() functions must be extended with the dataverse_download format (see https://github.com/AUSSDA/pyDataverse/issues/16 ), so the API response can be imported into a pyDataverse object (maybe directly, see https://github.com/AUSSDA/pyDataverse/issues/9 ) and handled further on. As I experienced some issues, with downloading draft dataset metadata, this must be checked and fixed before (https://github.com/AUSSDA/pyDataverse/issues/42). Then a standard for files and folders must be established (which is partly already done, named OAISTree, see https://github.com/AUSSDA/pyDataverse/issues/5 ). Another way would be to import/export BagIt's (https://github.com/AUSSDA/pyDataverse/issues/46). Finally, the diff/compare function needs to be written.

To handle all this, I used a history functionality, to keep the actual stated and it's history up to date (see https://github.com/AUSSDA/pyDataverse/issues/43 ).

So, there are some blocking issues, but once the to_json() and from_json() functions are adapted, the rest is not that tricky on first thought.

@hannesdatta Is it an urgent need for you? I am working on a major release right now for September, where already the feature freeze happened, so it's not going to be attacked before. (PS: I recommend using pyDataverse from the develop branch already).

hannesdatta commented 4 years ago

Awesome, thanks for putting this on the road map. It's not super urgent (I can move files to Dataverse manually for now). So after September is totally fine!

I would love pyDataverse to become part of a broader, more general workflow to be used at my school and beyond to (a) host Git repositories for data documentation, (b) use dataverse to store "confidential" versions of the data (i.e., only readable to the researcher, for example, because the files are (not yet) GDPR-compliant), and (c) use source code hosted on (a) to create and share derivative datasets publicly on Dataverse. I'm working on a project that collects research templates and workflows; the site is due to be updated, but this is what we have so far: http://tilburgsciencehub.com. One of my data sharing templates using the Java tool for Dataverse, but that one has only very basic functionality: https://github.com/hannesdatta/data-spotify-releases.

I went through your repositories and noticed our ideas about sharing data overlap very well (e.g., https://github.com/OKFNat/armsScraper, which corresponds to (a) mentioned above). As far as I've seen, the files are stored on Git though, so that's not feasible for large-scale data projects, and Dataverse may be the way to go.

Let me know how we/I can contribute. Happy to test-drive the developer version on our workflows. Let's have a general template that others can use then!

poikilotherm commented 3 years ago

Maybe this is sth. for GDCC/dvcli?

I am about to ramp up development for this tool for CI usage of deploying research software to Dataverse.

skasberger commented 3 years ago

@hannesdatta Sounds interesting. So let me share my first thoughts on that.

A lot of ideas, so maybe you can give me a bit more details on what's next or where to start at.

(a) host Git repositories for data documentation,

store the Git repos on Dataverse, or somewhere else?

(b) use dataverse to store "confidential" versions of the data (i.e., only readable to the researcher, for example, because the files are (not yet) GDPR-compliant),

This can be done by restricting and/or giving permissions to datasets to certain users or groups.

(c) use source code hosted on git to create and share derivative datasets publicly on Dataverse.

what exactly do you mean with that. Generic tools, like pandas, or specific ones like my armScraper?

Let's have a general template that others can use then!

Can you explain a bit more what you mean with templates in this context?

Let me know how we/I can contribute. Happy to test-drive the developer version on our workflows.

Testing is always great. In general, any contribution is great. Discussing here, testing and sharing the tool, or contribute to the code. Maybe you can share a bit of your use-case of pyDataverse, your experiences and what you expect from it in the future.

pdurbin commented 4 months ago

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python

pdurbin commented 4 months ago

@hannesdatta you might want to give DVUploader a try: https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader