MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

Using non-prodigal gene calls #83

Open jdwinkler-lanzatech opened 2 years ago

jdwinkler-lanzatech commented 2 years ago

Hi,

I have a set of already called genes (not from prodigal) that I would like to use as input to inStrain. Is there any easy way to do this? Based on a previous issue (#56 specifically), it appears that inStrain depends on prodigal input specifically.

MrOlm commented 2 years ago

Hello,

Yes, at the moment this is a pretty glaring limitation of inStrain. Fixing this has been near the top of my ToDo list for a while.

Can I ask what format your genes are in? There is experimental support for gene input in the form of genbank files, but that feature is not very well tested.

Best, MO

jdwinkler-lanzatech commented 2 years ago

Nothing fancy, it's basically just a fasta file as follows:

>locus tag 1
ATG...
...
TAG

I'm not sure what information inStrain is extracting from the prodigal-do you need coordinates? Maybe the file + GFF annotations might work better than gene calls?

MrOlm commented 2 years ago

Here is an example prodigal header:

>NC_000913_4 # 3734 # 5020 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.528

The things inStrain picks up from that are:

1) The scaffold, which is everything before the last _ (in this case the scaffold is NC_000913)

2) The start location, which is after the first # (in this case 3734; note that prodigal uses 1-based indexing)

3) The end location, which is after the second # (in this case 5020)

4) The strand, which is after the thing # (which in this case is 1. The options are 1 or -1

If you add all that to your gene headers it should work while I try and add a better solution.

Best, MO

Also just FYI, the specific function that does this parsing is parse_prodigal_genes in the file GeneProfile.py (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py)

jdwinkler-lanzatech commented 2 years ago

Okay, I can whip that up easily enough!

jdwinkler-lanzatech commented 2 years ago

Unfortunately I get the same coverage KeyError described in #56. I'll see if I can figure out why.

MrOlm commented 2 years ago

Let me know if this ends up remaining a problem. My guess is that the problem is that the coordinates of the genes are slightly different from how prodigal reports them. InStrain translates the gene into amino acid space based on the gene coordinates, so if they're off-by-one and it ends up with an amount of nucleotides that's not divisible by 3, that'll throw an error.

Let me know if this continues to be an issue, and at the very least I can update inStrain to throw the exception logs into the log file instead of STDOUT so we can at least know what the error code is.

-Matt

jdwinkler-lanzatech commented 2 years ago

That gives me an idea, maybe I'm outputting the coordinates incorrectly. I'll check that and get back to you!

nschan commented 2 years ago

Hi, similar to @jdwinkler-lanzatech I would be interested in being able to use non-prodigal gene calls or annotations. Following the discussion here, I took a look at GeneProfile.py and I noticed that there appears to be a function for parsing gbk files (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py#L811), which is documented as "EXPERIMENTAL; the name of the gene must be in the gene qualifier". I am currently trying to use gbk files from bakta, which to me seem like standard gbk files, here is a sample:

     gene            779..1435
                     /locus_tag="FCBPMO_00010"
     CDS             779..1435
                     /db_xref="RefSeq:WP_056275225.1"
                     /db_xref="SO:0001217"
                     /db_xref="UniParc:UPI0006F201EE"
                     /db_xref="UniRef:UniRef100_A0A0Q9DN46"
                     /db_xref="UniRef:UniRef50_A0A0M2H4C6"
                     /db_xref="UniRef:UniRef90_A0A0Q9DN46"
                     /product="hypothetical protein"
                     /locus_tag="FCBPMO_00010"
                     /protein_id="gnl|Bakta|FCBPMO_00010"
                     /translation="MTEHLQTTAPSPTPIARILTFLHRAWVAELRVYSSIARAIARRPA
                     VPKGGTGIGYHRPVLTVLFIFIGLSAVEIPILDLIVHQWPVVRITLLVLGIWGVTWMVG
                     LLCAMLMRPHAVGPDGVRVRSGLELDVPIGWADIASIAISRRVDEPKQPRVIGSEYAER
                     MQDETNIEIELERPVAIRLPGLAPKGGTHQVDRIRLWADDPRAFLDAARPFLVAA"
                     /codon_start=1
                     /transl_table=11
                     /inference="ab initio prediction:Prodigal:2.6"
                     /inference="similar to AA sequence:RefSeq:WP_056275225.1"

But instrain profile returns the error

File "/usr/local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 821, in parse_genbank_genes
      gene = feature.qualifiers[gene_name][0]
  KeyError: 'gene'

I am not completely sure what to make of it, should i replace /locus_tag with /gene?

Thank you


Amendment: Replacing /locus_tag with /gene resolves the error above. However the issue described in #56 occurs now. I am using the instrain container (quay.io/biocontainers/instrain:1.6.1--pyhdfd78af_0).

MrOlm commented 2 years ago

Hi @nschan -

Thanks for reaching out and sorry for this mess. The current .gbk parsing is really weird, and really only works for specific weirdly-formatted .gbk files. I have re-written a good part of this code in a developmental version of inStrain that I'm working on but it's not quite ready yet.

I really hope to have this working and pushed to the development branch of inStrain soon; until then all I can say is that prodigal-formatted genes are really the only supported option.

Apologies again, Matt

nschan commented 1 year ago

Hi Matt,

Thanks for getting back to me. Using prodigal for inStrain works very well. Currently I am adding in the annotations later on during analysis, which works fine for me. From my limited experience gbk files tend to be kind of poorly standardized and reliably parsing them can be a bit of a pain.. Maybe more of a feature suggestion, but do you by any chance have plans to expand the snv region information "intergenic" to mark snvs located in non-cds annotations (rRNA, tRNA, etc)?

Kind regards Niklas

MrOlm commented 1 year ago

Hi @nschan -

I will add this to my "To Do" list, but it likely won't be ready for a while. Right now a big reason I do the intragenic gene calling is to call non-synonymous / synonymous mutations, which isn't possible for non-cds annotations, but I do see the value in this.

Best, Matt