is regex-based NCBI ID check needed?

PatrickRWright / CopraRNA

Target prediction for prokaryotic trans-acting small RNAs

MIT License

4 stars 3 forks source link

is regex-based NCBI ID check needed? #27

Closed martin-raden closed 5 years ago

martin-raden commented 5 years ago

https://github.com/PatrickRWright/CopraRNA/blob/68703440972ecfd4a9ab44fed6e488bc2456f1ac/CopraRNA2.pl#L322

this regex check makes it impossible to add NCBI IDs that do not start with NC or NZ.

Since we subsequently check whether the ID is within the supported list, I have to raise the question whether this regex check is needed or only complicates input checking..

Or does the prefix encode a specific subset of organisms/entries in NCBI?

Thanks for clearification, Martin

PS. the issue popped up for entry CP000407

martin-raden commented 5 years ago

ping @JensGeorg

JensGeorg commented 5 years ago

Maybe a question for Patrick. I don't see a requirement for a Refseq check besides the comparison with the kegg2refseq.. file.

PatrickRWright commented 5 years ago

As far as I remember I originally decided to only use RefSeq records since these have a sensible degree of consistency. Over the past years this has been working quite well and only few RefSeqs triggered unforseen exceptions. Briefly scanning the file you referred to makes me think that it should technically work, but I'm not sure how many exceptions this might trigger. Potentially you are opening Padora's box with regard to errors on CopraRNA runs.

martin-raden commented 5 years ago

Hi both, thanks for your input. After digging around and thinking more about it, I agree with Patrick that we might stick to this for now. Best, Martin