Sequences less than 200bp not accepted by GenBank for AIRR submission

initoby commented 6 years ago

Notes below about this issue are from Lori Black @ NCBI GenBank sent in an email response about some of the sequences in one of the fasta files I had submitted for the AIRR standards.

[2] Many of the sequence(s) in your file(s) are less than 200 bp.

Unfortunately, we must inform you that we have a policy not to accept sequences shorter than 200 bp. We realize that this has short-term consequences for submitters, but feel that the long-term improvements in the database will be helpful for all database users.

If you resubmit your sequence submission(s) with additional sequence, we may then be able to accept your sequence(s). Alternatively, if you would like us to delete the sequence(s) that are under 200 bp and proceed with the rest, please inform us.

lgcowell commented 6 years ago

Thanks Ini. All – This means no data from Adaptive Biotech can be submitted. Perhaps that’s fine since they have their own repository, but then I guess that would need to be mentioned in the standard.

frubelt commented 6 years ago

Thank you. Yes, we should mention that this could be a problem. We just have to decide if we specifically say in the manuscript that it can’t be used for adaptive reads or more neutral only for sequences with more than 200bp and only be more specific in the detailed documentation.

Florian

Florian Rubelt, Dr. Mark M. Davis Laboratory Howard Hughes Medical Institute Stanford University School of Medicine

On Oct 13, 2017, at 2:31 PM, lgcowell notifications@github.com<mailto:notifications@github.com> wrote:

Thanks Ini. All – This means no data from Adaptive Biotech can be submitted. Perhaps that’s fine since they have their own repository, but then I guess that would need to be mentioned in the standard.

— —————————————— Lindsay G. Cowell, PhD Associate Professor Division of Biomedical Informatics Department of Clinical Sciences University of Texas Southwestern Medical Center Lindsay.Cowell@utsouthwestern.edumailto:Lindsay.Cowell@utsouthwestern.edu mailto:Lindsay.Cowell@utsouthwestern.edu 214-648-2289

Administrative Assistant: Mack Dressler Mack.Dressler@UTSouthwestern.edumailto:Mack.Dressler@UTSouthwestern.edu mailto:Mack.Dressler@UTSouthwestern.edu 214-648-2558

From: initoby notifications@github.com<mailto:notifications@github.com> Reply-To: airr-community/airr-standards reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, October 13, 2017 at 3:55 PM To: airr-community/airr-standards airr-standards@noreply.github.com<mailto:airr-standards@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: [airr-community/airr-standards] Sequences less than 200bp not accepted by GenBank for AIRR submission (#26)

Notes below about this issue are from Lori Black @ NCBI GenBank sent in an email response about some of the sequences in one of the fasta files I had submitted for the AIRR standards.

[2] Many of the sequence(s) in your file(s) are less than 200 bp.

Unfortunately, we must inform you that we have a policy not to accept sequences shorter than 200 bp. We realize that this has short-term consequences for submitters, but feel that the long-term improvements in the database will be helpful for all database users.

If you resubmit your sequence submission(s) with additional sequence, we may then be able to accept your sequence(s). Alternatively, if you would like us to delete the sequence(s) that are under 200 bp and proceed with the rest, please inform us.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/airr-community/airr-standards/issues/26, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AF4uhdMvG-4i6oFMdfN-ob2q3z8lmd-_ks5sr85TgaJpZM4P5CwW.

UT Southwestern

Medical Center

The future of medicine, today.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/airr-community/airr-standards/issues/26#issuecomment-336572501, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AefwkLuyhwYaZxlITAjZmgdyjxB2kEaCks5sr9a2gaJpZM4P5CwW.

bussec commented 6 years ago

Hy Ini. Thanks, that's clearly a limitation we were not aware of. What kind of reads did you try to submit?

bcorrie commented 6 years ago

Although I can't speak to how important this limitation of GenBank is, it does speak to the importance of ensuring that we don't "bias" the terminology in the standard toward the NCBI repositories. I know that there has been a lot of work around mapping the standard to NCBI, but we should be careful around the wording in MiAIRR such that we do not make it to NCBI focussed...

We have had a couple of internal discussions around our curation process, and one of the comments was that the descriptions of some of the MiAIRR fields explicitly says it requires an NCBI identifier when not all studies end up in NCBI.

For example: "1 / study Study String Alphanumeric UID assigned by NCBI"

This should probably be changed to something like:

"Unique alphanumeric ID for the study, typically the UID assigned by an international repository such as NCBI"

lgcowell commented 6 years ago

excellent point.

bussec commented 6 years ago

The latter issue has been changed to "Unique ID assigned by study registry" in 8f63c1e8dde61ba771bc062f1fa74f74c061ac25. This commit also includes other changes that should make the content definitions independent of NCBI.

airr-community / airr-standards

Sequences less than 200bp not accepted by GenBank for AIRR submission #26