MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

Indexing error in IS_gene_info.tsv #144

Closed Winshipe closed 1 year ago

Winshipe commented 1 year ago

Hi Matt,

I'm getting an indexing error in the IS_gene_info.tsv file. It seems that after certain genes are removed from the analysis, there can be a mismatch between the gene name and its coordinates. In the example below we see that gene 40 in the gene_info file assumes the coordinates of gene 39 in the FNA after genes 36-38 in the FNA are excluded. I've seen this issue repeated in different contigs and in different gene_info files.

What conditions lead to genes being excluded from the file? There are some in the test data that are excluded (eg N5_271_010G1_scaffold_2_26) but they don't seem to suffer from this indexing issue.

Here in the gene info file we have (some numbers truncated for readability):

contig_10143 contig_10143_35 678.0 1.6 0.7 0.0 24453 25130 1 False 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contig_10143 contig_10143_39 393.0 1.0 0.1 0.1 0.0 27656 28048 1 False 4.0 0.0 0.0 0.0 0.0 0.0 4.0 contig_10143 contig_10143_40 525.0 6.0 0.9 0.7 0.0 27982 28506 -1 False 0.2451237263464334 0.1002778880508137 18.0 11.0 3.0 5.0 3.0 2.0 23.0

whereas the FNA file looks like this:

>contig_10143_35 # 24454 # 25131 # 1 # ID=2_35;partial=00;start_type=ATG;rbs_motif=GGGGG;rbs_spacer=4bp;gc_cont=0.674 >contig_10143_36 # 25403 # 26980 # 1 # ID=2_36;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.696 >contig_10143_37 # 26996 # 27328 # 1 # ID=2_37;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.658 >contig_10143_38 # 27325 # 27621 # 1 # ID=2_38;partial=00;start_type=GTG;rbs_motif=GGAG;rbs_spacer=6bp;gc_cont=0.704 >contig_10143_39 # 27983 # 28507 # -1 # ID=2_39;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.697 >contig_10143_40 # 28501 # 30591 # -1 # ID=2_40;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.716

Thanks for your help, Eamon

MrOlm commented 1 year ago

Hi Eamon,

Thanks for posting this interesting problem. So that I can troubleshoot, would you letting me know 1) the command that you ran to generate this file, 2) the version of inStrain that you're running, and 3) attaching the input genes files, the output gene_info file, and the log file.

Thanks again, Matt

Winshipe commented 1 year ago

Hi Matt,

I inherited this project from someone else and unfortunately it seems like the original log files have been lost. I haven't seen this issue replicated on the re-run analysis (still using IS v1.5.7) and I haven't been able to replicate this using the test data either (using IS v1.7.1). My pet theory is that it could be due to improper filtering of the bam files??

Thanks and sorry for the bother, Eamon

MrOlm commented 1 year ago

OK interesting. I'm glad it's not being too much of a problem anymore, but please reach out if you hit the problem again.

Best, Matt