MrOlm / inStrain

Bioinformatics program inStrain
MIT License
134 stars 33 forks source link

inStrain profile not profiling all genes #167

Closed rebeccasophiasalcedo closed 4 months ago

rebeccasophiasalcedo commented 4 months ago

Hi!

I'm running inStrain on a bunch of whole metagenome assemblies to calculate gene by gene coverage in each sample. It works like a dream on all of my samples except for one, where it's not profiling the majority of my genes or scaffolds.

Here's my command for that sample specifically: inStrain profile /scratch/users/rsalcedo/4500m1000m_site/bowtie/bams/4500m1000m_reads_to_contigs_sorted.bam /scratch/users/rsalcedo/OC17wc_sequences/assemblies/4500m1000m_contigs_min1000.fa -o /scratch/users/rsalcedo/4500m1000m_site/inStrain/4500m1000m_inStrain/ -g /scratch/users/rsalcedo/OC17wc_sequences/prodigal_annotations/fna/4500m1000m_contigs_min1000_prodigalORFs.fna -p 32 --skip_mm_profiling --skip_plot_generation -c 1

I ran inStrain version 1.7.1 and my bam was generated with bowtie2.

Right now the gene_info.tsv only has 911 entries and the scaffold_info.tsv entry has 1105 entries. My whole assembly has 983997 genes and 385088 scaffolds. I dug into the raw_data directory and saw that the genes_table.csv has the correct number of entries, but the genes_coverage.csv (2692) and scaffold_list.txt (1105) are both off.

I checked the mapping_info.tsv and it has all 385088 scaffolds listed but looks like most of my scaffolds didn't have any mapped reads. I went back and checked my bam file and it looks okay. I also ran coverM on my assembly and all contigs recruited reads, with ~82% recruiting at 5x coverage or more, so there's a discrepancy between inStrain that says my scaffolds have no recruited reads vs coverM/my bam file that says that there is recruitment.

I think it has something to do with how inStrain is reading in the coverage info for this particular bam file? I'm not sure though. If you have any advice it'd be much appreciated! Here is my log.log file, let me know if I can get you any more information to help problem solve.

Thank you!

MrOlm commented 4 months ago

Hi @rebeccasophiasalcedo -

Thank you for the detailed bug report! Based on what you've described, it does seem like there's a problem with inStrain reading the reads mapped to some of your scaffolds. This could be due to 1) a mismatch between the scaffold names in the .fasta file and the scaffold names in the .bam file, 2) a weird character in the names of the .fasta file or .bam file, or 3) some other bug with pysam (what inStrain uses to load the reads). A couple of questions-

About how many other samples did you run that worked? A few, or many?

Did the other samples use the same .fasta file? I there anything different about how this sample was processed as compared to the other files?

Thanks, Matt

MrOlm commented 4 months ago

Also in looking at the log file, inStrain is reporting that it's removing ~50% of reads during filtering. Could you provide the mapping_info.tsv file as well?

Thanks, Matt

rebeccasophiasalcedo commented 4 months ago

I have a total of 28 samples and had no problems with the other 27! And they all used their respective assembly fasta, so the .fa file used in this command is only ever used here for this analysis. Though I generated MAGs using all assemblies (incl. this one) and cross-mapped for metaBAT and ran into no issues there.

I just checked my sam and fasta and the headers look alright. I didn't catch any whitespace characters either.

head 4500m1000m_reads_to_contigs.sam @HD VN:1.5 SO:unsorted GO:query @SQ SN:OC1703_4500m_1000m_sens_contig_1483280 LN:1041 @SQ SN:OC1703_4500m_1000m_sens_contig_3485708 LN:1284

head 4500m1000m_contigs_min1000.fa

OC1703_4500m_1000m_sens_contig_4153184 OC1703_4500m_1000m_sens_contig_593313

Here's the mapping_info.tsv file!

rebeccasophiasalcedo commented 4 months ago

oh and when I ran coverM I required 95% ANI too so the recruitment of coverM and what inStrain's default requirements are for mapping are comparable!

MrOlm commented 4 months ago

OK thanks for this. In looking at your mapping_info.tsv file, it looks to be like inStrain thinks that all the reads mapping to lots of the contigs are singletons (unpaired reads). This could be because they are, or because of some sort of bug in inStrain / pysam that's make it think they are.

You could confirm one way or the other by trying to filter the .bam file to remove unpaired reads and then running coverM again (or something like that).

Or, if you could just add --pairing_filter all_reads to just keep the singleton reads and not filter them out.

-Matt

rebeccasophiasalcedo commented 4 months ago

okay I'll give both of those a shot and report back, thank you!

rebeccasophiasalcedo commented 4 months ago

running it with --pairing_filter all_reads worked! I think it's because my reads actually are all singletons and probably not a bug based on some samtools digging, thank you so much for helping!