MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

Troubleshooting inStrain profile Step 2. Profile scaffolds #141

Closed jmartin77777 closed 1 year ago

jmartin77777 commented 1 year ago

I'm running inStrain profile on a set of samples that have been mapped to the Unified Human Gut Genomes (UHGG) v2.0.1, and all the samples are failing in 'Step 2. Profile scaffolds'. Immediately prior to this execution I had run inStrain profile on the same samples mapped against UHGG v1.0 successfully, so its likely the problem is with something I'm doing but I can't figure it out. For UHGG v1.0 you had prepared all the needed assets and made them available on your website, so for that version I was using your files. For the problematic jobs I'm using assets I created following the instructions from the user manual. Its possible that the error lies in one of those files. The error output I'm seeing is this:

Scaffold to bin was made using .stb file
4610 of 4744 genomes have less than 1x estimated coverage
685016 of the original 697011 scaffolds are removed (685016 have a low coverage genome; 0 have no 
genome)
3.2% of reads were removed during filtering
14,739,084 read pairs remain (4.194 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************

Traceback (most recent call last):
  File "/opt/conda/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/opt/conda/lib/python3.7/site-packages/inStrain/controller.py", line 53, in main
    self.profile_operation(args)
  File "/opt/conda/lib/python3.7/site-packages/inStrain/controller.py", line 82, in profile_operat
ion
    ProfileController(args).main()
  File "/opt/conda/lib/python3.7/site-packages/inStrain/controller.py", line 147, in main
    self.run_profile()
  File "/opt/conda/lib/python3.7/site-packages/inStrain/controller.py", line 315, in run_profile
    **self.kwargs)
  File "/opt/conda/lib/python3.7/site-packages/inStrain/profile/__init__.py", line 18, in profile_
bam
    bam, fasta_db, sR2M, ISP_loc, **kwargs).main()
  File "/opt/conda/lib/python3.7/site-packages/inStrain/profile/profile_controller.py", line 48, i
n main
    self.gen_prof_args()
  File "/opt/conda/lib/python3.7/site-packages/inStrain/profile/profile_controller.py", line 78, i
n gen_prof_args
    self.gen_genes_args()
  File "/opt/conda/lib/python3.7/site-packages/inStrain/profile/profile_controller.py", line 89, i
n gen_genes_args
    scaff2geneinfo, scaff2gene2sequence = inStrain.GeneProfile.parse_genes(gene_file, **self.kwarg
s)
  File "/opt/conda/lib/python3.7/site-packages/inStrain/GeneProfile.py", line 754, in parse_genes
    return parse_prodigal_genes(gene_file_loc)
  File "/opt/conda/lib/python3.7/site-packages/inStrain/GeneProfile.py", line 781, in parse_prodig
al_genes
    table['direction'].append(record.description.split("#")[3].strip())
IndexError: list index out of range

I've been over the UHGG v2.0.1 files I made and they seem OK, and I noticed in the error output it mentions the Prodigal program. My first thought was that its somehow getting the wrong gene annotations, but then I realized that I provided no gene annotation. Prodigal is making the gene calls, so if its an issue with the tool not accepting the output of Prodigal that may actually be something internal. Does anyone have any advice on what this error may be?

I'm using an old version of inStrain (v1.5.3) to make the outputs comparable with some older work I've done, but if necessary I could try re-doing everything using the latest version. I'm just hoping maybe the error message might suggest something obvious before I start redoing everything

MrOlm commented 1 year ago

Hi @jmartin77777 - indeed this is almost definitely a problem with prodigal. My hunch is that the gene input file being provided to inStrain is incorrect; prodigal makes a bunch of different genes files and the .fna one is the one you'll need to provide to inStrain.

Could you send me a head of the genes files you're providing to inStrain?

Best, Matt

jmartin77777 commented 1 year ago

I think I've found my problem, I pulled the gene fasta directly from the UHGG ftp site and concatenated them to build my 'UHGG_reps.genes.fna' gene file. But their annotations were not built by prodigal, or maybe they stripped the Prodigal annotation from the headers when they did functional annotation. I was comparing the UHGG v1.0 assets that you built to what I had built, and your headers include the prodigal info, eg.:

GUT_GENOME000001_1_1 # 1 # 81 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.210

The genes fasta I built for v2.0.1 had headers like this:

MGYG000002944_00001 Carbamoyl-phosphate synthase small chain

and I noticed in the error out you are doing a split on '#', and at that point everything clicked. I am running prodigal now and will re-run inStrain profile on my mapped data again using a genes fasta built as your tutorial suggests. I'm pretty sure this was my issue, entirely my fault. Sorry for the trouble

MrOlm commented 1 year ago

Great! Thank you for the update. Please let me know if this doesn't fix the problem.

Best, Matt