Open jdwinkler-lanzatech opened 2 years ago
Hello,
Yes, at the moment this is a pretty glaring limitation of inStrain. Fixing this has been near the top of my ToDo list for a while.
Can I ask what format your genes are in? There is experimental support for gene input in the form of genbank files, but that feature is not very well tested.
Best, MO
Nothing fancy, it's basically just a fasta file as follows:
>locus tag 1
ATG...
...
TAG
I'm not sure what information inStrain is extracting from the prodigal-do you need coordinates? Maybe the file + GFF annotations might work better than gene calls?
Here is an example prodigal header:
>NC_000913_4 # 3734 # 5020 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.528
The things inStrain picks up from that are:
1) The scaffold, which is everything before the last _
(in this case the scaffold is NC_000913
)
2) The start location, which is after the first #
(in this case 3734; note that prodigal uses 1-based indexing)
3) The end location, which is after the second #
(in this case 5020)
4) The strand, which is after the thing #
(which in this case is 1
. The options are 1
or -1
If you add all that to your gene headers it should work while I try and add a better solution.
Best, MO
Also just FYI, the specific function that does this parsing is parse_prodigal_genes
in the file GeneProfile.py (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py)
Okay, I can whip that up easily enough!
Unfortunately I get the same coverage KeyError described in #56. I'll see if I can figure out why.
Let me know if this ends up remaining a problem. My guess is that the problem is that the coordinates of the genes are slightly different from how prodigal reports them. InStrain translates the gene into amino acid space based on the gene coordinates, so if they're off-by-one and it ends up with an amount of nucleotides that's not divisible by 3, that'll throw an error.
Let me know if this continues to be an issue, and at the very least I can update inStrain to throw the exception logs into the log file instead of STDOUT so we can at least know what the error code is.
-Matt
That gives me an idea, maybe I'm outputting the coordinates incorrectly. I'll check that and get back to you!
Hi, similar to @jdwinkler-lanzatech I would be interested in being able to use non-prodigal gene calls or annotations. Following the discussion here, I took a look at GeneProfile.py and I noticed that there appears to be a function for parsing gbk files (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py#L811), which is documented as "EXPERIMENTAL; the name of the gene must be in the gene qualifier". I am currently trying to use gbk files from bakta, which to me seem like standard gbk files, here is a sample:
gene 779..1435
/locus_tag="FCBPMO_00010"
CDS 779..1435
/db_xref="RefSeq:WP_056275225.1"
/db_xref="SO:0001217"
/db_xref="UniParc:UPI0006F201EE"
/db_xref="UniRef:UniRef100_A0A0Q9DN46"
/db_xref="UniRef:UniRef50_A0A0M2H4C6"
/db_xref="UniRef:UniRef90_A0A0Q9DN46"
/product="hypothetical protein"
/locus_tag="FCBPMO_00010"
/protein_id="gnl|Bakta|FCBPMO_00010"
/translation="MTEHLQTTAPSPTPIARILTFLHRAWVAELRVYSSIARAIARRPA
VPKGGTGIGYHRPVLTVLFIFIGLSAVEIPILDLIVHQWPVVRITLLVLGIWGVTWMVG
LLCAMLMRPHAVGPDGVRVRSGLELDVPIGWADIASIAISRRVDEPKQPRVIGSEYAER
MQDETNIEIELERPVAIRLPGLAPKGGTHQVDRIRLWADDPRAFLDAARPFLVAA"
/codon_start=1
/transl_table=11
/inference="ab initio prediction:Prodigal:2.6"
/inference="similar to AA sequence:RefSeq:WP_056275225.1"
But instrain profile returns the error
File "/usr/local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 821, in parse_genbank_genes
gene = feature.qualifiers[gene_name][0]
KeyError: 'gene'
I am not completely sure what to make of it, should i replace /locus_tag with /gene?
Thank you
Amendment: Replacing /locus_tag with /gene resolves the error above. However the issue described in #56 occurs now. I am using the instrain container (quay.io/biocontainers/instrain:1.6.1--pyhdfd78af_0).
Hi @nschan -
Thanks for reaching out and sorry for this mess. The current .gbk parsing is really weird, and really only works for specific weirdly-formatted .gbk files. I have re-written a good part of this code in a developmental version of inStrain that I'm working on but it's not quite ready yet.
I really hope to have this working and pushed to the development branch of inStrain soon; until then all I can say is that prodigal-formatted genes are really the only supported option.
Apologies again, Matt
Hi Matt,
Thanks for getting back to me. Using prodigal for inStrain works very well. Currently I am adding in the annotations later on during analysis, which works fine for me. From my limited experience gbk files tend to be kind of poorly standardized and reliably parsing them can be a bit of a pain.. Maybe more of a feature suggestion, but do you by any chance have plans to expand the snv region information "intergenic" to mark snvs located in non-cds annotations (rRNA, tRNA, etc)?
Kind regards Niklas
Hi @nschan -
I will add this to my "To Do" list, but it likely won't be ready for a while. Right now a big reason I do the intragenic gene calling is to call non-synonymous / synonymous mutations, which isn't possible for non-cds annotations, but I do see the value in this.
Best, Matt
Hi,
I have a set of already called genes (not from prodigal) that I would like to use as input to inStrain. Is there any easy way to do this? Based on a previous issue (#56 specifically), it appears that inStrain depends on prodigal input specifically.