Closed jonperdomo closed 2 years ago
Good reference for Python CI workflow: Github Docs
LongReadSum, NanoPlot, and my own MATLAB implementation yield identical results for N50 using sample input FASTA files (individual and combined). These will be used for testing the FASTA output: | File | N50 |
---|---|---|
all_chr.hap1.cns.fa | 23758872 | |
all_chr.hap2.cns.fa | 21856627 | |
Both files | 23544988 |
Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)
Tool | File | N50 |
---|---|---|
LongReadSum | FASTQ | 8731 |
NanoPlot | FASTQ | 8733 |
Mine | FASTQ | 8733 |
LongReadSum | BAM | 7415 |
NanoPlot | BAM | 6398 |
Mine | BAM | 6398 |
PycoQC | BAM + sequencing_summary.txt | Basecalls = 8730 Alignments = 3500 |
MinIONQC | sequencing_summary.txt | 8733 |
Also check number of alignment (mapped and unmapped), total bases, total mapped bases. N50 is a value that is not too informative by itself; it is important in genome assembly, but not in characterizing sequenced reads. You can extract fastq/fasta file from BAM file as well using bedtools, which should help diagnose the discrepancies.
On Tue, Jun 14, 2022 at 4:41 PM Jonathan Perdomo @.***> wrote:
Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File N50 LongReadSum FASTQ 8731 NanoPlot FASTQ 8733 Mine FASTQ 8733 LongReadSum BAM 7415 NanoPlot BAM 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733
— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1155691310, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBV3WI4VXD6W373MCTVPDU5XANCNFSM5XEWL7DA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)
Tool | File | Read count | Total bases | Total mapped bases | Number of aligned reads | Number of unaligned reads | N50 |
---|---|---|---|---|---|---|---|
LongReadSum | FASTQ | 50 | 340189 | NA | NA | NA | 8731 |
NanoPlot | FASTQ | 50 | 340189 | NA | NA | NA | 8733 |
Mine | FASTQ | 50 | 340189 | NA | NA | NA | 8733 |
LongReadSum | BAM | 50 | 340189 | 304323 | 50 | 0 | 7415 |
NanoPlot | BAM | 106 |
493817 | 314962 | NA | NA | 6398 |
Mine | BAM | 50 primary, 106 primary + secondary,supplementary |
340189 primary, 493817 primary + secondary,supplementary |
161334 primary, 314962 primary + secondary,supplementary (via samtools) |
50 | 0 | 8733 primary, 6398 primary + secondary,supplementary |
PycoQC | BAM + sequencing_summary.txt | 50 | 340189 | 161334 | 50 | 0 | Basecalls = 8730 Alignments = 3500 |
MinIONQC | sequencing_summary.txt | 50 | 340200 | NA | NA | NA | 8733 |
LongReadSum | FASTQ from BAM (via bedtools) |
106 | 493817 | NA | NA | NA | 6397 |
NanoPlot | FASTQ from BAM (via bedtools) |
106 | 493817 | NA | NA | NA | 6398 |
Are the BAM results for NanoPlot directly reported by this software? What is the raw output? (does it offer any clues why read count is 106?)
On Sat, Jun 18, 2022 at 1:54 PM Jonathan Perdomo @.***> wrote:
Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File Read count Total bases Total mapped bases Number of aligned reads Number of unaligned reads N50 LongReadSum FASTQ 50 340189 NA NA NA 8731 NanoPlot FASTQ 50 340189 NA NA NA 8733 Mine FASTQ 8733 LongReadSum BAM 50 340189 304323 50 0 7415 NanoPlot BAM 106 (=primary + supplementary alignments?) 493817 314962 NA NA 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733 LongReadSum FASTQ from BAM (via bedtools) NanoPlot FASTQ from BAM (via bedtools) Mine FASTQ from BAM (via bedtools)
— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1159524146, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEIBBFDMEDXPKDDHO3VPYENTANCNFSM5XEWL7DA . You are receiving this because you commented.Message ID: @.***>
No, I run NanoPlot separately to compare results. This is the raw NanoPlot output for the BAM file:
General summary:
Average percent identity: 87.4 Fraction of bases aligned: 0.6 Mean read length: 4,658.7 Mean read quality: 9.7 Median percent identity: 88.3 Median read length: 3,579.5 Median read quality: 9.2 Number of reads: 106.0 Read length N50: 6,398.0 STDEV read length: 3,829.5 Total bases: 493,817.0 Total bases aligned: 314,962.0 Number, percentage and megabases of reads above quality cutoffsQ5: 106 (100.0%) 0.5Mb Q7: 99 (93.4%) 0.5Mb Q10: 32 (30.2%) 0.2Mb Q12: 15 (14.2%) 0.1Mb Q15: 5 (4.7%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 17.9 (1239; bf3ca0c2-c10a-40f5-bf35-4b6d2ba8479a) 2: 17.4 (3242; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 16.9 (3103; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 4: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 5: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)
LongReadSum has more information on read counts for alignments, which indicate that possibly the supplementary aligments are being included in the total read count:
I tested this by filtering out unmapped, secondary, and supplementary alignments via the SAMTools view function. When filtered out, the number of reads output by NanoPlot is correct (=50):
General summary:
Average percent identity: 91.7 Fraction of bases aligned: 0.5 Mean read length: 6,803.8 Mean read quality: 10.1 Median percent identity: 93.2 Median read length: 5,194.0 Median read quality: 9.6 Number of reads: 50.0 Read length N50: 8,733.0 STDEV read length: 4,256.2 Total bases: 340,189.0 Total bases aligned: 161,334.0 Number, percentage and megabases of reads above quality cutoffsQ5: 50 (100.0%) 0.3Mb Q7: 50 (100.0%) 0.3Mb Q10: 18 (36.0%) 0.1Mb Q12: 7 (14.0%) 0.0Mb Q15: 2 (4.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 2: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 14.3 (6925; e621bda1-0721-402e-9ca4-49cd42ff41cc) 4: 13.9 (5314; bad54463-4c8b-4007-8375-2df0687508f8) 5: 13.6 (3505; 00a29863-c29f-461c-bd56-1943fd2711a8) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)
I used LongReadSum to verify that this filtered BAM file only contains the primary alignments:
okay I see, thank you.
@jonperdomo "I used LongReadSum to verify that this filtered BAM file only contains the primary alignments:" LongReadSum has a filter where you can set other alignments to be false and only the alignments with true will be considered. You can change the program to allow users to specify those filter setting so that you can focus on the primary alignments only.
From this updated table it is clear that most differences in basic output statistics are due to using primary alignments only vs. primary + secondary, supplementary alignments. This should probably be specified in the output. Some remaining differences:
Tool | File | Read count | Total bases | Total mapped bases | Number of aligned reads | Number of unaligned reads | N50 |
---|---|---|---|---|---|---|---|
LongReadSum | FASTQ | 50 | 340189 | NA | NA | NA | 8731 |
NanoPlot | FASTQ | 50 | 340189 | NA | NA | NA | 8733 |
Mine | FASTQ | 50 | 340189 | NA | NA | NA | 8733 |
LongReadSum | BAM | 50 | 340189 | 304323 | 50 | 0 | 7415 |
NanoPlot | BAM | 106 |
493817 | 314962 | NA | NA | 6398 |
Mine | BAM | 50 primary, 106 primary + secondary,supplementary |
340189 primary, 493817 primary + secondary,supplementary |
161334 primary, 314962 primary + secondary,supplementary (via samtools) |
50 | 0 | 8733 primary, 6398 primary + secondary,supplementary |
PycoQC | BAM + sequencing_summary.txt | 50 | 340189 | 161334 | 50 | 0 | Basecalls = 8730 Alignments = 3500 |
MinIONQC | sequencing_summary.txt | 50 | 340200 | NA | NA | NA | 8733 |
Implement an action that will test basic statistics: Base count, read count, and N50.