airr-community / common-repo-wg

AIRR Community Common Repository Working Group
Apache License 2.0
3 stars 3 forks source link

Recommendation 4: too specific (SRA/Genbank)? #10

Closed bcorrie closed 6 years ago

bcorrie commented 6 years ago

Hi All, In reading the recommendations and seeing how they apply to iReceptor, it occurred to me that Recommendation 4 might be to specific - that is specifying explicitly SRA and Genbank and only those... Although I am not an expert about other repositories (nor these ones) it seems that this is very narrow and somewhat North America specific. Would it make more sense to have something like this:

Recommendation 4: For long-term storage, data and metadata should be deposited in one of the International Nucleotide Sequence Database Collaboration (INSDC) archives such as SRA, Genbank, and ENA, per the recommendations established by the AIRR Minimal Standards Working Group. The AIRR Working Groups should work with the INSDC archives to coordinate the accurate gathering and storage of metadata for AIRR data.

In this way, we are recommending that data be published in one of the recognized national/international repositories but not telling people "exactly" what to do. If INSDC has another collaborator soon, then that should be a reasonable option. As long as the second phrase is there, and the AIRR Community works with the repositories to ensure there are easy mechanisms to store data (as has been done with SRA and Genbank), then this should be fine...

lgcowell commented 6 years ago

Hi Brian,

Traveling so responding on my phone, so brief for now. Recommendation 4 came specifically from the minimal standards group. That is their recommendation which we incorporated for consistency. So this may be an issue you want to raise with that group. Thanks!

bussec commented 6 years ago

Hy Brian & Lindsay

Although the reference implementation the MiniStd WG is working on is based on SRA & Genbank, I do not see any general reasons to object against Brian's changes. The main points (i.e. free and open deposition of the sequence data in a public DB that has long-term maintenance) will be served by any of the INSDC databases. In addition, when thinking about data sets that require controlled access, for EU-based depositors it will be simpler to go for EGA than for dbGAP.

The devil is - as usual - in the details and in this case it is the metadata mapping, which is not uniform for INSDC once you go beyond the "flat file". Thus ENA's data scheme differs slightly from the one of NCBI. I asked the ENA helpdesk about this end of May:

[We have] completed the mapping of the [MiniStd items] to NCBI's BioProject/BioSample system. However, ENA's metadata structure (studies, experiment, sample, run) seems to a bit different. Therefore I wanted to ask whether there is already any existing scheme for mapping metadata between the two databases.

On which their answer was:

It turns out there is no easy way of doing this. However, every of the ENA SRA studies/samples has a BioProject/BioSample equivalent in NCBI, so de facto you could extract mapping rules from public metadata XMLs.

We have not yet found the time to come up with a mapping and it is not our top priority right now.

So in summary, yes we should broaden recommendation 4 to all INSDC DB's, but keep in mind that the current implementation only supports SRA/Genbank.

bcorrie commented 6 years ago

I think that makes sense, recognizing that there is the "principle" of having the data in the INSDC DB and the implementation, which is having a mechanism/process to upload data to a specific one of those DBs that meets AIRR minimal standards. The implementation will almost always lag behind the principle, and I think that is OK...

If we agree that this makes sense, we are agreeing that the data can reside in any of the INSDC repositories and that the AIRR community will work with them, over time, to come up with processes for those repositories to enable uploading data easily.

The current status of our implementation of such processes are: SRA/GenBank templates done, other templates are on the roadmap - but as Christian says, not a high priority right now.

I think agreeing with this means that we are adding scope to the Minimal Standards Working Group in that we are saying that the community, managed through the MSWG, should come up with a mechanism to make it easy to load AIRR data into ENA etc...

As Lindsay says, this does need to get tabled for discussion at the AIRR MSWG.

lgcowell commented 6 years ago

Thanks Christian. I agree, but I think MS would have to broaden theirs and then we would modify to be consistent.

bcorrie commented 6 years ago

I have created an issue with Minimal Standard in this regard...

https://github.com/airr-community/airr-standards/issues/45

bussec commented 6 years ago

Please see commit c8e751a5490ae3e7ed8147cdf7c95279c75d7e50 for altered wording.