airr-community / common-repo-wg

AIRR Community Common Repository Working Group
Apache License 2.0
3 stars 3 forks source link

"changelog" for repositories... #26

Closed bcorrie closed 6 months ago

bcorrie commented 5 years ago

Now that we have had a number of repositories up and running for a while, we have realized that the provenance of the data is quite important. For example, consider the following use case:

User 1 scratches their head and has no idea what happened... From a science reproducibility perspective, this is bad. If there is a "changelog" on each repository, then it would be possible to determine if it was a change in the repository that caused this issue.

There is a relatively simple solution to this issue. Each repository could optionally (not sure we can make it mandatory) maintain a "changelog" for their repository. One easy way for this to work would be to assume that a "changelog" exists on a web page somewhere. It could be maintained manually on a static site, it could exist on the repository site, and it could even be generated by the repository itself.

From a CRWG API perspective, this would be easy to implement through the /info entry point for the API, simply adding a new field to the /info response so it would look something like:

{ "name": "iReceptor Public Archive (IPA)", "version": "v1.0", "changelog": "http://www.ireceptor.org/repositories/IPA/changelog" }

This makes the changelog completely independent of the actual repository (the DB and the service don't have to do anything). The nice thing is that if the repository actually managed its own changelog (when new data is added) and generated an automatic page, you could still point to the repositories generated page:

https://ipa.ireceptor.org/airr/v1/info

could generate:

{ "name": "iReceptor Public Archive (IPA)", "version": "v1.0", "changelog": "http://ipa.ireceptor.org/airr/v1/changelog" }

The changelog interface isn't part of the AIRR API, but it doesn't stop the repository and service from providing one...

We could go further in solving this issue by adding a /changelog entry point into the API, but that feels like overkill to me...

Thoughts?

This seems like it is a pretty important concept that we should be considering...

Brian

lgcowell commented 5 years ago

I agree this is important. What is the argument against making it mandatory?

schristley commented 5 years ago

We discussed this in the WG awhile ago under the topic of versioning data, though I can see a changelog as being different in that only a single version of the data is maintained whereby the changelog only explains what has changed. What would be nice is if that changelog could be formal enough such that you could regenerate the previous version of the data if needed.

For the study metadata (repertoires), the data is small enough that we could do actual versioning of the data versus just a changelog. The challenge is with the rearrangement data.

bcorrie commented 5 years ago

In answer to @lgcowell, it would be easy to make it mandatory if we kept it simple as per my description. Thou shalt have a changelog and provide a URL to that changelog. A changelog is low hanging fruit in that it is easy to mandate in the API and easy to implement (if done in a way similar to what I describe above).

I agree with Scott, the changelog is really the first step... The real value comes from versioning and being able to reproduce the data. But I think that is very difficult to specify, difficult to implement, and would be a huge burden on a repository implementer.

We could take the following approach if we wanted to move in this direction. 1) Make a changelog mandatory in the /info entry point, but keep it very simple, and don't mandate the content of the changelog. 2) A second layer might be to add some rigor and definition to the changelog, so that changelogs were comparable across repositories. This is metadata after all... 3) Finally we could dive in to tackling the data versioning aspects.

I think 2 and 3 are probably outside the scope of what we are talking about for the immediate future, but perhaps they should be put on the agenda for a future issue that needs to be addressed.

bcorrie commented 5 years ago

Hmm, in thinking about this a bit more, we could choose to tackle 2) above if we wanted to, but I think it would slow things down in getting an API defined and implemented... On the flip side starting 2) would possibly set the stage for 3) and not starting to think about 2) now might make it more difficult to move to 3) smoothly???

lgcowell commented 5 years ago

That makes sense. Thanks!

lgcowell commented 5 years ago

In my opinion, the API is defined and has been implemented by one repository (VDJServer). Granted, needed revisions may become apparent as additional repositories implement the API, but until then, it seems widening our scope to include 2) won’t slow us down. That coupled with your second comment “not starting to think about 2) now might make it more difficult to move to 3) smoothly” makes me think we do want to go ahead and begin thinking about 2). What do the others think?

schristley commented 5 years ago

I agree. I think what we've done so far is good enough for a V1 of the API that we can publish. We can then think about what we want to work on for V2 while keeping backward compatibility with V1.

lgcowell commented 5 years ago

?

Sent from my iPhone

On Nov 28, 2018, at 10:02 AM, Scott Christley notifications@github.com<mailto:notifications@github.com> wrote:

I agree. I think what we've done so far is good enough for a V1 of the API that we can publish. We can then think about what we want to work on for V2 while keeping backward compatibility with V1.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/airr-community/common-repo-wg/issues/26#issuecomment-442500986, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AF4uhR511at3WS2ahaU8I3KTjqk4ywoWks5uzrOagaJpZM4YxC31.


UT Southwestern

Medical Center

The future of medicine, today.

bcorrie commented 6 months ago

Closing this as an old issue... Cleaning up various artifacts in the CRWG repository.