MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

Coverage discrepancy with inStrain profile #99

Open cswift3 opened 2 years ago

cswift3 commented 2 years ago

Hello, I am using inStrain profile to calculate SNVs for long-read (nanopore) sequencing data. However, I am concerned that inStrain is discounting a large proportion of my input reads. The BAM file that I am using as input to inStrain profile has ~400x coverage. I found that after running inStrain profile, the coverage is between 100 and 200x in most positions. I have checked the ReadFiltering plot (attached to this post below) as well as the mapping_info.tsv and 91% (25,203 out of 27,747) of the reads are passing filter, so I am confused as to why the reported coverage is so much less. The command I used is as follows: inStrain profile -c 10 --pairing_filter all_reads --min_read_ani 0.92 /path/sorted.bam /path/reference.fasta -o /path/output/directory

Please advise ASAP as I am currently in the review process for manuscript. If I cannot get this resolved quickly, unfortunately I will have to use another tool for variant calling.

Thank you, Candice

inStrain_rerun_barcode85_ReadFiltering_plot.pdf

MrOlm commented 2 years ago

Hi Candice,

Hmm interesting. A couple of questions-

1) What program to calculate coverage gave you the count of ~400x?

2) Could you attached the file mapping_info.tsv from the inStrain output file?

Thanks, Matt

cswift3 commented 2 years ago

Hi Matt, Thanks so much for the quick response! I visualized the BAM file in Tablet and IGV. I also used another program that is part of the ARTIC field bioinformatics pipeline (specifically artic minion) to call the variants and it gave me ~400x at the position of interest (contrast to ~50x from instrain profile for this SNS). Exact values as follows: Tablet average coverage depth (for the entire BAM)=317.6x

Tablet coverage at position 23,403=400x IGV coverage at position 23,403=347x ARTIC minion total reads at position 23,403=395x The disagreement between IGV and Tablet at this position is because reads with deletions/insertions are not included in the total reads. When these are added back in, IGV agrees with Tablet and ARTIC.

So right now I have 3 different methods in agreement for the coverage and inStrain is the outlier.

Attached is the mapping_info.tsv. I also tried attaching a zipped BAM and BAI file, but the file was too big to go through email.

Thanks again, Candice

On Sat, Dec 18, 2021 at 2:23 PM Matt Olm @.***> wrote:

Hi Candice,

Hmm interesting. A couple of questions-

1.

What program to calculate coverage gave you the count of ~400x? 2.

Could you attached the file mapping_info.tsv from the inStrain output file?

Thanks, Matt

— Reply to this email directly, view it on GitHub https://github.com/MrOlm/inStrain/issues/99#issuecomment-997272953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK6JF7GHIA5KQIPYDI4CFF3URTNR3ANCNFSM5KKS7V5Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

MrOlm commented 2 years ago

Hi Candice,

Thanks for all of this- it certainly seems like inStrain is struggling with the long-reads. I'll note that I've never run it with long reads before, but I'd really like it to work with long reads, so I appreciate your help troubleshooting this.

The mapping_info.tsv file you attached didn't work; I believe you have to attach it on GitHub rather than email it. It would also be great if you could attach the SNVs.tsv file as well.

I'll look at the file once I get it, but in thinking through potential problems, do you happen to know if you have multiple reads with the same name? inStrain treats reads with the same name as being the "same", so that could mess with the count?

Best, Matt

cswift3 commented 2 years ago

Hi Matt, Thanks for looking into this. I really appreciate it. I am attaching the two files via GitHub. I had to change the file extensions to TXT, but they are the same as the original TSV output from inStrain. The fastq headers are unique.

Thanks, Candice

inStrain_rerun_barcode85_mapping_info_tsv.txt .

MrOlm commented 2 years ago

OK- I think I got to the root of the problem. Thanks for your help.

When calling SNVs, inStrain uses an empirically generated distribution of "expected" variants based on Q30 bases. So when calling SNVs (and calculating coverage, it seems), it only looks at reads with q30. I assume that because you have long reads, many don't reach this q30 threshold, and only the ones that do are "counted" in the SNV table.

Adjusting the minimum threshold to be something other than Q30 is trivial (it's line 154 in the file "profile_utilities.py"), but adjusting the SNV-calling assumptions and distributions is complicated.

This is something that I will fix, and I'll do it ASAP, but I'm not sure how long it will take. I'll of course ping you as soon as it's finished. But if you need an immediate solution, I'd say your options are to use another program, or to state that only reads with a quality sore of at least Q30 were used when calling SNVs.

Sorry again for the confusion and thanks for your help tracking this down and helping me improve inStrain!

-Matt

cswift3 commented 2 years ago

Hi Matt, Great! I'm glad you got to the bottom of it! Definitely let me know when you have a fix and we'll see where I'm at in responding to other reviewer requests. I really appreciate your help.

Thanks again, Candice