Speed comparison with GBParsy

althonos / gb-io.py

A Python interface to gb-io, a fast GenBank parser written in Rust.

MIT License

14 stars 0 forks source link

Speed comparison with GBParsy #39

Open xapple opened 1 year ago

xapple commented 1 year ago

I was looking for a fast way of processing large amounts of genbank entries, and found your library. It definitely offers an improvement over biopython, but I'm wondering why did you not include GBParsy in the speed comparison? It is a parser written in pure C, and likely even faster than gb-io.

Lee TH, Kim YK, Nahm BH. GBParsy: a GenBank flatfile parser library with high speed. BMC Bioinformatics. 2008 Jul 25;9:321. doi: 10.1186/1471-2105-9-321. PMID: 18652706; PMCID: PMC2516526.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2516526/

https://github.com/thlee/gbfp/

althonos commented 1 year ago

Hi @xapple,

I did not include GBParsy because i was not aware of this project, and since it's not on PyPI it's not exactly the most convenient, tools-included GenBank parser out there. Additionally, I tried to build from source from the GitHub repository you linked, but the code seems quite outdated (it still uses the PyString_FromStringAndSize C API, which was removed from Python 3)...

xapple commented 1 year ago

Yes, you are right, the code was written in 2008 which is sixteen years ago, and is probably not compatible with the current Python C API. Also, it has not been uploaded to PyPI or conda-forge.

Digging a bit deeper I did realize that the code on the GitHub repository is an export of the old google-code repository and doesn't represent the latest version. The repository has v0.5.0 while the supplementary file of the publication iteself includes v0.6.0 (2008-07-10) at:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2516526/bin/1471-2105-9-321-S1.tgz