MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

inStrain compare: {genome} is in input {input} but not the provided stb file #178

Closed Sanrrone closed 2 months ago

Sanrrone commented 3 months ago

Dears, Along with greeting you, I got the following error related with the stb file:

#running command
inStrain compare -o HeP-1057_compare -p 4 -s /scratch/project_2007362/software/HumGutDB/hg.tsv -i HeP-1057-10 HeP-1057-11 HeP-1057-12 HeP-1057-13 HeP-1057-14 HeP-1057-15 HeP-1057-16 HeP-1057-17 HeP-1057-6 HeP-1057-7 HeP-1057-8 HeP-1057-9 --database_mode -d
#error
Scaffold to bin was made using .stb file
22903 is in input HeP-1057-10 but not the provided stb file!
Traceback (most recent call last):
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 57, in main
    self.compare_operation(args)
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 89, in compare_operation
    inStrain.compare_controller.CompareController(args).main()
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/lib/python3.8/site-packages/inStrain/compare_controller.py", line 46, in main
    self.parse_arguments()
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/lib/python3.8/site-packages/inStrain/compare_controller.py", line 133, in parse_arguments
    scaffolds = inStrain.compare_utils.find_relevant_scaffolds(input, bin2scaffolds, self.kwargs)
  File "/scratch/project_2007362/software/mambaforge/envs/instrain/lib/python3.8/site-packages/inStrain/compare_utils.py", line 143, in find_relevant_scaffolds
    raise Exception(f'{genome} is in input {input} but not the provided stb file!')
Exception: 22903 is in input HeP-1057-10 but not the provided stb file!

my stb file (-s /scratch/project_2007362/software/HumGutDB/hg.tsv) is a two column tab separated file in the format scaffold\tbin and is the same file used in the profile step.

#head /scratch/project_2007362/software/HumGutDB/hg.tsv
kraken:taxid|3017012|HumGut_17012_1 1
kraken:taxid|3017907|HumGut_17907_1 2
kraken:taxid|3022015|HumGut_22015_1 3
kraken:taxid|3027808|HumGut_27808_1 4
kraken:taxid|3013491|HumGut_13491_1 5
kraken:taxid|3013343|HumGut_13343_1 6
kraken:taxid|3022605|HumGut_22605_1 7
kraken:taxid|3021670|HumGut_21670_1 8
kraken:taxid|3015643|HumGut_15643_1 9
kraken:taxid|3018628|HumGut_18628_1 10
...

I tried by reducing the amount of samples. But, still is the same error (with other contig bin) . what could I do?

thanks in advance, Sandro

MrOlm commented 3 months ago

Hello,

This problem is a result of the .stb file in the profile step being different from the compare. Specifically, if you look at the genome_info.tsv file of your sample "HeP-1057-10" you will find the genome "22903", but this genome is not in the provided .stb file.

Best, Matt

Sanrrone commented 3 months ago

Effectively the '22903' is in the genome_info file. However, the file .stb file (hg.tsv in my case) is the same in both executions (profile and compare).

$ grep -w 22903 /scratch/project_2007362/software/HumGutDB/hg.tsv
kraken:taxid|3020030|HumGut_20030_1 22903

So I do not understand why the error. is it because I am using numbers instead of strings in the bin column?

just in case, my parameters are:

#profile
inStrain profile --use_full_fasta_header -p $c -c 7 --min_scaffold_reads 7 -s $new/software/HumGutDB/hg.tsv --skip_plot_generation -o ${sname} $bam $new/software/HumGutDB/hg.fasta --database_mode

#compare
inStrain compare -o ${hep}_compare --skip_plot_generation -p $c -s $new/software/HumGutDB/hg.tsv -i $samples --database_mode -d
MrOlm commented 3 months ago

Hello,

Urg, I do worry it might be due to the numbers instead of strings for bin names. I thought I fixed a few years ago, but it's possible that I only fixed it for profile and not compare.

If you could please confirm that you're running the most recent version of inStrain, that would be ideal. If so, this is likely a number / string problem that I need to fix. As a workaround, adding a letter to your bin names (even just an "a" in front of all of them) should fix the issue.

Apologies, Matt

Sanrrone commented 3 months ago

I installed it via conda

inStrain -h

                ...::: inStrain v1.8.0 :::...

  Matt Olm and Alex Crits-Christoph. MIT License. Banfield Lab, UC Berkeley.

  Choose one of the operations below for more detailed help. See https://instrain.readthedocs.io for documentation.
  Example: inStrain profile -h

  Main operations:
    profile           -> Create an inStrain profile (microdiversity analysis) from a mapping file
    compare            -> Compare multiple inStrain profiles (popANI, coverage_overlap, etc.)

  Auxiliary operations:
    check_deps        -> Print a list of dependencies, versions, and whether they're working
    parse_annotations -> Run a number of outputs based a table of gene annotations 
    quick_profile     -> Quickly calculate coverage and breadth of a mapping using coverM
    filter_reads      -> Commands related to filtering reads from .bam files
    plot              -> Make figures from the results of "profile" or "compare"
    other             -> Other miscellaneous operations

Best, Sandro

Sanrrone commented 2 months ago

Just to complete the issue, it is solved by adding an non-numeric name to the bins as you suggested.

thank you!