airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

coordination and curation of studies in the ADC #431

Closed schristley closed 8 months ago

schristley commented 4 years ago

This has come up in a number of contexts. With ADC V1 out, there is more activity to curate historical studies and put them into data repositories. It would be good to coordinate so there isn't too much duplication of effort. Also, it might be helpful to have a common resource about curation questions, how to code thing properly into MiAIRR and so forth. Some things we might want to consider:

other ideas?

bcorrie commented 4 years ago
  • Related to the last point, curating a study has multiple steps, creating the AIRR study metadata is one, but the sequences also need to be processed and loaded, which can be more time-consuming. Do we consider it ok for repositories to put up repertoires even though the rearrangements are not available yet? Do we want to require that a study must appear "all at once" in a repository so partial data isn't queried? Introduce a flag to warn the user that a study is "in process of loading?" I can see a certain benefit for repertoire metadata to be available immediately.

We try to do everything at once (not have metadata loaded without rearrangements) as an iReceptor repository policy. Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective. Our gateway searches AIRR-seq data, and it ain't AIRR-seq data without rearrangements 8-) I think it is potentially even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study. From a data provenance perspective that seems bad to me... It seems the most consistent for a study and its data to "come on line" all at once

Now whether we want to make that a policy or not, I am not sure, as it seems a bit draconian to me...

For example, we recently had our latest COVID-19 paper added to the AIRR COVID-19 repository. The paper released in pre-print and it stated the data was on line in iReceptor but we didn't have the data loaded yet. So we broke our own policy as we didn't want people to go to the iReceptor Gateway and not find the study (even though there were no rearrangements). I think something like that might is OK, as long as it doesn't sit that way for a long period of time. We were in the process of loading the rearrangements and we enabled the study with the rearrangements a couple of days later...

I think having repertoire metadata with no rearrangements for a long period of time is probably bad. How we create a recommendation and policy around that I am not too sure...

bcorrie commented 4 years ago
  • How to signal that a repository is in the process of curating a study? There's the informal method of using the lists up on the b-t.cr forums, but one thing I've noticed is there is a size limit to posts on b-t.cr so these lists invariably will need to span multiple posts, and they will need to be re-adjusted all the time as new papers are added.

Yes, this is cumbersome, and we spend a fair bit of time keeping this up to date. It would be nice if we had a better mechanism...

schristley commented 4 years ago

Yes, this is cumbersome, and we spend a fair bit of time keeping this up to date. It would be nice if we had a better mechanism...

Yeah, me too, and have a dozen studies in various phases of being processed...

schristley commented 4 years ago

So we broke our own policy as we didn't want people to go to the iReceptor Gateway and not find the study

Yep, that's exactly the type of benefit I was thinking of...

schristley commented 4 years ago

Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective.

Right, but it seems you don't have control over that, as repositories will be doing whatever. Then your only choice from a Gateway perspective is either turn off the whole repository, or live with the "partial-ness".

even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study.

I agree and I feel that could serious ramifications, somebody downloading partial data, do analysis, publish etc, without ever realizing they didn't have the full data. In this case, I feel it's important from a scientific perspective to avoid this, but yeah we may not be able to enforce it, but it's worth putting as a recommendation so that data repositories understand why it is a "bad thing"

How we create a recommendation and policy around that I am not too sure...

The easiest solution to me seems to be adding a flag into the repertoire. This would allow clients like the Gateway to include or exclude those repertoires, have a setting the user can toggle to include or not, have a warning message "study in process of being loaded", or something like that. It won't handle all scenarios but can cover a lot.

bcorrie commented 4 years ago

Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective.

Right, but it seems you don't have control over that, as repositories will be doing whatever. Then your only choice from a Gateway perspective is either turn off the whole repository, or live with the "partial-ness".

even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study.

I agree and I feel that could serious ramifications, somebody downloading partial data, do analysis, publish etc, without ever realizing they didn't have the full data. In this case, I feel it's important from a scientific perspective to avoid this, but yeah we may not be able to enforce it, but it's worth putting as a recommendation so that data repositories understand why it is a "bad thing"

Yes, I agree, a strong recommendation perhaps, stressing scientific reproducibility and data provenance as the drivers for this recommendation.

bcorrie commented 4 years ago

The easiest solution to me seems to be adding a flag into the repertoire. This would allow clients like the Gateway to include or exclude those repertoires, have a setting the user can toggle to include or not, have a warning message "study in process of being loaded", or something like that. It won't handle all scenarios but can cover a lot.

Not sure about this one. It relies on adherence to the recommendation, just like not having partial data does 8-)

And it might not help that much, because you simply raise the level at which there is confusion. You might have a study that is incomplete with only 1 out of 10 repertoires loaded, but there is no way of knowing that either. So you have the same problem.

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

bcorrie commented 2 years ago

@schristley this doesn't seem to be required for a v1.4 release, and I won't have time to get to it in the next few days, do you? If not, should we remove it from the v1.4 milestone?

bcorrie commented 8 months ago

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

@schristley I suggest adding this to the documentation for v2.0 as a recommendation for data stewards/data curators - although it can't really be enforced.

schristley commented 8 months ago

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

@schristley I suggest adding this to the documentation for v2.0 as a recommendation for data stewards/data curators - although it can't really be enforced.

I think we are starting to get to a solution that will allow some level of partialness. When this issue was opened it was mainly Repertoire and Rearrangement but now the notion of "entire study" has broken down with the new objects and API endpoints like Clone and etc.

We have added study keywords like contains_schema_rearrangement which can be used as a flag like I suggested above. The question is how to interpret those keywords? If you take them to mean that "the authors generated rearrangement data as part of the study", then that has a different meaning from the ADC repository has rearrangement data loaded for this study. If you think the former interpretation is what is meant, then my suggestion is we extend the Study object in the API with a field like adc_keywords (maybe you have a better suggestion?) to represent the latter interpretation.

Now that we have agreed upon an extension capability for the API, let's use it to formalize some behaviors we'd like.

bcorrie commented 8 months ago

@schristley I added some basic "Running a repository" docs here:

https://github.com/airr-community/airr-standards/pull/752/commits/ab2b8ce470fcfe72840236ab58f35dd746058a68

bcorrie commented 8 months ago

I think we are starting to get to a solution that will allow some level of partialness. When this issue was opened it was mainly Repertoire and Rearrangement but now the notion of "entire study" has broken down with the new objects and API endpoints like Clone and etc.

I think in a way, the docs I just added kind of address this. Maybe we can make it a bit more explicit.

I agree, it makes sense to for a data steward to load "part of a study" if part of a study means

Later they might add Clone data. Or maybe Cell + Expression. We do this all the time.

I think what we want to suggest is that it is not a good idea to be loading the data for one of the Schema objects (Rearrangement, Clone, Cell, Expression etc) a bit at a time. Basically, if you are loading Rearrangements for a Study then only make that data available in the ADC once all of the Rearrangements are loaded.

bcorrie commented 8 months ago

I think we should come up with some docs now for the v2.0 release - maybe what I have now is sufficient? Maybe a bit more detail.

We can then consider the extensions that you are talking about, which I think are beyond v2.0?

bcorrie commented 8 months ago

We have added study keywords like contains_schema_rearrangement which can be used as a flag like I suggested above. The question is how to interpret those keywords? If you take them to mean that "the authors generated rearrangement data as part of the study", then that has a different meaning from the ADC repository has rearrangement data loaded for this study. If you think the former interpretation is what is meant, then my suggestion is we extend the Study object in the API with a field like adc_keywords (maybe you have a better suggestion?) to represent the latter interpretation.

We currently use the study keywords to indicate that there is data in the repository for that schema object for that study. So if a Repertoire in our repository has contains_schema_rearrangement set, then there will be Rearrangement data in the repository.

This is an interesting subtle difference you bring up, which makes me think the more detailed discussion is definitely v2.1.

bcorrie commented 8 months ago

Made changes to docs to clarify loading schema objects in their entirety (e.g. Rearrangements) as opposed to load a study in its entirety (which may include Rearrangement, Clone, Cell, CellExpression, etc).

Basically we are saying it is OK to load Rearrangements, put a repository into production, then load Cells later. It is frowned on to load a single schema object (e.g. Rearrangement) data from a study partially on a production repository (e.g. having a study with only half the Rearrangements loaded in production).

bcorrie commented 8 months ago

Created #756 to capture new issue, closing this issue.