versioning system for datasets

jraffa commented 4 years ago

This might be more of a tech support issue, and we can talk offline if needed.

I am setting up the GOSSIS-1 project on physionet. The project will have multiple versions based on what datasets (countries) are included. I'm wondering how to manage these. I could setup a new project for each version, but that seems excessive.

I was going to have multiple versions and do something like:

GOSSIS-1_AU_NZ_USAv1.0.

Which tells you it was GOSSIS-1 dataset, the countries, and what iteration of that dataset it is. Unfortunately versions can only have numbers.

To complicate things, I was hoping to have credentialing per-version. There should also be a correspondence with the code, but I suppose that can be done in the release notes citing a specific commit in git.

Any suggestions?

tompollard commented 4 years ago

Unfortunately versions can only have numbers.

Or fortunately, depending on your perspective! There used to be more flexibility, but as you say we now encourage/enforce software-style semantic versioning for projects.

I was going to have multiple versions and do something like: GOSSIS-1_AU_NZ_USAv1.0.

I think it's a little confusing to include country names in the version identifier. If an additional country was added, would the number be kept static or would it be advanced? (i.e. would it be GOSSIS-1_AU_NZ_USA_INDIAv1.0 or GOSSIS-1_AU_NZ_USA_INDIAv1.1?)

My suggestion would be to release the initial dataset as GOSSIS v1.0. For minor updates (e.g. bug fixes), you can advance the 2nd number (e.g. GOSSIS v1.0 becomes GOSSIS v1.1). For major updates you could advance the first number (e.g. GOSSIS v1.0 becomes GOSSIS v2.0).

I think it's personal decision about whether adding a new country is treated as a major update or whether this becomes an entirely new dataset (e.g. GOSSIS-II v1.0). My preference is the former, because I think it helps to build the community around a single resource. There is a "Release Notes" section in the metadata where you can describe the update.

To complicate things, I was hoping to have credentialing per-version.

Credentialing currently only happens once for a user, but I think that users are required to re-sign the DUA for new versions.

I'm not 100% sure that I'm remembering correctly, but I think that the behavior for self-managed projects is that users do need to request access for a new version, but I would need to double check.

jraffa commented 4 years ago

Thanks Tom.

I think managing access is going to be tricky then. I didn't do a good job explaining, but the list of countries is not about adding a new country to the dataset as a new version, but rather data contributors having a say in whether they want to grant access to an individual user/project. e.g., India says it doesn't want to grant access to a particular project or Japan is not allowed legally to export data to particular country. I was intending on managing this process off-PhysioNet, where the user would just have access to a specific version of the dataset or not.

I'll talk with you offline about it, but it sounds like I might have to create a separate project for each country, and then the user would have to download them csv separately and stack the rows of the csv files themselves.

tompollard commented 4 years ago

Thanks Jesse, if I understand correctly you would like to be able to manage permissions at a file level, which isn't something that we can do right now. I think the only solution would be to have a separate project for each country, as you suggest.

There might be benefit in having a "core" GOSSIS project, along with country modules. This would be a similar approach to MIMIC-IV. A parent relationship can be added to the project metadata (e.g. see: https://physionet.org/content/eicu-crd-demo/).

tompollard commented 4 years ago

Just to add a quick note on this, I think there are some benefits to having countries submit data as separate projects. First, it means that contributors can make updates to their data as more becomes available.

More importantly, it provides a straightforward mechanism for encouraging contributions. i.e. we say to potential contributors, something like this:

"If you would like to join the GOSSIS consortium, please submit your data to the PhysioNet platform for review. You should select "GOSSIS" as the parent project when submitting"

We could think about improving our ability to display collections of projects (which would be useful for MIMIC too). For example, potentially we could have "Collections" pages, that allow people to browse collections of related resources.

jraffa commented 4 years ago

This is likely a separate conversation, because there's a process to normalize the data to the 'GOSSIS format'. If we start relying on the contributors to normalize then this would make sense, but our group has been doing it so far. Sites can certainly contribute their 'raw data' on PhysioNet. Right now, I think there's a desire to have a standardized format and a centralized process to obtaining this data, and this was the main purpose for having these datasets on PhysioNet under the GOSSIS umbrella.

tompollard commented 4 years ago

This is likely a separate conversation, because there's a process to normalize the data to the 'GOSSIS format'.

It might be possible to handle the process of normalizing data through the submission system. (1) Someone submits (2) we give feedback via the system explaining what changes need to be made or we restructure directly if needed (3) project gets published as a GOSSIS module.

MIT-LCP / physionet-build

versioning system for datasets #1084