issues parsing genbank files produced by Pharokka

vmkhot commented 7 months ago

Hello,

While trying to parse gbk files produced by Pharokka using Biopython, I came into this error 4694

Essentially, some of the qualifier keys in the genbank records are too long and wrap to the next line but Biopython has no way to handle this.

I addressed it with Biopython developers and ended up editing file parser in Bio package itself (scanner.py) as a workaround.

Their suggestion was for me to reach out to you with the error also so that perhaps you can fix these wrap-around-keys

They also added a warning when writing genbank files with sketchy qualifier keys 4703

Thanks! and thanks for your tool too :)

Varada

gbouras13 commented 7 months ago

Hi @vmkhot ,

Thanks for this - 2021-2 me who wrote pharokka was pretty average at coding. A little bit improved now I hope! Pharokka needs a complete refactor to be honest at some point when I get some time.

The issue steps from this part of pharokka https://github.com/gbouras13/pharokka/blob/40c78f65a19dd33e0b2a30b1a8ca434f3c8a2792/bin/processes.py#L760

My question here is, are the ID and locus tag lines wrapping an issue? Or only the other qualifiers (VFDB, CARD etc) - I do also think that 'function' might be an issue with DNA, RNA metabolism. If you came across more please let me know.

The ID and locus tags being too long I am not sure what fix I can put in unless I truncate them (which will almost certainly cause issues inside Pharokka). If I do this, that may make them not unique between CDS. I would say that for these, the user really needs to rename their contig IDs and/or use shorter locus tags.

If it is only the VFDB/CARD qualifiers, well I agree that I should remove all spaces/quotes/brackets and should be an easy fix - seems pretty stupid by me to have them like this in hindsight.

George

vmkhot commented 7 months ago

Hi George,

Thanks for your reply! I didn't have any issues with the IDs or locus tags and the ones in the gbk files I parsed were quite long (like below). The only errors I got was with one of the qualifier keys. Below is an example of a problematic feature. In particular, when we debugging, we found that it was /resistance-nodulation-cell division (RND) antibiotic efflux pump="true" that was breaking the biopython parser.

     CDS             568..951
                     /ID="ERZ1035062_ERZ1035062.110-NODE-110-length-34883-cov-5.
                     965832_CDS_0002"
                     /transl_table=11
                     /phrog="30333"
                     /top_hit="p76161 VI_12486"
                     /locus_tag="ERZ1035062_ERZ1035062.110-NODE-110-length-34883
                     -cov-5.965832_CDS_0002"
                     /function="unknown function"
                     /product="hypothetical protein"
                     /CARD_short_name="marA"
                     /AMR_Gene_Family="General Bacterial Porin with reduced
                     permeability to beta-lactams"
                     /resistance-nodulation-cell division (RND) antibiotic
                     efflux pump="true"
                     /CARD_species="Escherichia coli str. K-12 substr. W3110"
                     /source="Pyrodigal-gv_0.2.0"
                     /score="40.0"
                     /phase="0"
                     /translation="MSRRNTDAITIHSILDWIEDNLESPLSLEKVSERSGYSKWHLQRM
                     FKKETGHSLGQYIRSRKMTEIAQKLKESNEPILYLAERYGFESQQTLTRTFKNYFDVPP
                     HKYRMTNMQGESRFLHPLNHYNS*"

There might be more instances that are problematic but our workaround skipped over these so this is the only one I know of.

In terms of renaming or truncating IDs and locus tags - I strongly prefer it when programs don't auto-rename data as I often use that information downstream to map results back and forth. I agree that the contig names are way too long in my dataset. Typically, my workflows include renaming my bins and contigs to meaningful headers before using programs like Pharokka, but the gbk files were not generated by me so just trying to make the most of what's available :)

gbouras13 commented 7 months ago

Thanks @vmkhot I'm glad the locus tags and IDs are ok.

I should be able to fix this in the next update of pharokka by cleaning up the format-breaking metadata from CARD and VFDB (not that it will be helpful for you necessarily but still it will be helpful downstream!) - thanks for alerting me.

George

gbouras13 commented 4 months ago

Hi @vmkhot ,

I've put in a fix to solve this issue (I hope) and it will be available in v1.7.3 soon.

Regarding your data, I see you're in Jena with Bas - I think I was probably involved in generating it :) and have some improvements to make with https://github.com/gbouras13/phold coming soon. Best to move chat over email if you'd like, george.bouras@adelaide.edu.au

George

gbouras13 / pharokka

issues parsing genbank files produced by Pharokka #339