WGLab / LongReadSum

MIT License
13 stars 2 forks source link

Implement basic statistics testing action #9

Closed jonperdomo closed 2 years ago

jonperdomo commented 2 years ago

Implement an action that will test basic statistics: Base count, read count, and N50.

jonperdomo commented 2 years ago

Good reference for Python CI workflow: Github Docs

jonperdomo commented 2 years ago
LongReadSum, NanoPlot, and my own MATLAB implementation yield identical results for N50 using sample input FASTA files (individual and combined). These will be used for testing the FASTA output: File N50
all_chr.hap1.cns.fa 23758872
all_chr.hap2.cns.fa 21856627
Both files 23544988
jonperdomo commented 2 years ago

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)

Tool File N50
LongReadSum FASTQ 8731
NanoPlot FASTQ 8733
Mine FASTQ 8733
LongReadSum BAM 7415
NanoPlot BAM 6398
Mine BAM 6398
PycoQC BAM + sequencing_summary.txt Basecalls = 8730
Alignments = 3500
MinIONQC sequencing_summary.txt 8733
kaichop commented 2 years ago

Also check number of alignment (mapped and unmapped), total bases, total mapped bases. N50 is a value that is not too informative by itself; it is important in genome assembly, but not in characterizing sequenced reads. You can extract fastq/fasta file from BAM file as well using bedtools, which should help diagnose the discrepancies.

On Tue, Jun 14, 2022 at 4:41 PM Jonathan Perdomo @.***> wrote:

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File N50 LongReadSum FASTQ 8731 NanoPlot FASTQ 8733 Mine FASTQ 8733 LongReadSum BAM 7415 NanoPlot BAM 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733

— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1155691310, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBV3WI4VXD6W373MCTVPDU5XANCNFSM5XEWL7DA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonperdomo commented 2 years ago

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)

Tool File Read count Total bases Total mapped bases Number of aligned reads Number of unaligned reads N50
LongReadSum FASTQ 50 340189 NA NA NA 8731
NanoPlot FASTQ 50 340189 NA NA NA 8733
Mine FASTQ 50 340189 NA NA NA 8733
LongReadSum BAM 50 340189 304323 50 0 7415
NanoPlot BAM 106
493817 314962 NA NA 6398
Mine BAM 50 primary,
106 primary + secondary,supplementary
340189 primary,
493817 primary + secondary,supplementary
161334 primary,
314962 primary + secondary,supplementary
(via samtools)
50 0 8733 primary,
6398 primary + secondary,supplementary
PycoQC BAM + sequencing_summary.txt 50 340189 161334 50 0 Basecalls = 8730
Alignments = 3500
MinIONQC sequencing_summary.txt 50 340200 NA NA NA 8733
LongReadSum FASTQ from BAM
(via bedtools)
106 493817 NA NA NA 6397
NanoPlot FASTQ from BAM
(via bedtools)
106 493817 NA NA NA 6398
kaichop commented 2 years ago

Are the BAM results for NanoPlot directly reported by this software? What is the raw output? (does it offer any clues why read count is 106?)

On Sat, Jun 18, 2022 at 1:54 PM Jonathan Perdomo @.***> wrote:

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File Read count Total bases Total mapped bases Number of aligned reads Number of unaligned reads N50 LongReadSum FASTQ 50 340189 NA NA NA 8731 NanoPlot FASTQ 50 340189 NA NA NA 8733 Mine FASTQ 8733 LongReadSum BAM 50 340189 304323 50 0 7415 NanoPlot BAM 106 (=primary + supplementary alignments?) 493817 314962 NA NA 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733 LongReadSum FASTQ from BAM (via bedtools) NanoPlot FASTQ from BAM (via bedtools) Mine FASTQ from BAM (via bedtools)

— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1159524146, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEIBBFDMEDXPKDDHO3VPYENTANCNFSM5XEWL7DA . You are receiving this because you commented.Message ID: @.***>

jonperdomo commented 2 years ago

No, I run NanoPlot separately to compare results. This is the raw NanoPlot output for the BAM file:

General summary:
Average percent identity: 87.4 Fraction of bases aligned: 0.6 Mean read length: 4,658.7 Mean read quality: 9.7 Median percent identity: 88.3 Median read length: 3,579.5 Median read quality: 9.2 Number of reads: 106.0 Read length N50: 6,398.0 STDEV read length: 3,829.5 Total bases: 493,817.0 Total bases aligned: 314,962.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 106 (100.0%) 0.5Mb Q7: 99 (93.4%) 0.5Mb Q10: 32 (30.2%) 0.2Mb Q12: 15 (14.2%) 0.1Mb Q15: 5 (4.7%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 17.9 (1239; bf3ca0c2-c10a-40f5-bf35-4b6d2ba8479a) 2: 17.4 (3242; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 16.9 (3103; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 4: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 5: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)

LongReadSum has more information on read counts for alignments, which indicate that possibly the supplementary aligments are being included in the total read count: image

jonperdomo commented 2 years ago

I tested this by filtering out unmapped, secondary, and supplementary alignments via the SAMTools view function. When filtered out, the number of reads output by NanoPlot is correct (=50):

General summary:
Average percent identity: 91.7 Fraction of bases aligned: 0.5 Mean read length: 6,803.8 Mean read quality: 10.1 Median percent identity: 93.2 Median read length: 5,194.0 Median read quality: 9.6 Number of reads: 50.0 Read length N50: 8,733.0 STDEV read length: 4,256.2 Total bases: 340,189.0 Total bases aligned: 161,334.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 50 (100.0%) 0.3Mb Q7: 50 (100.0%) 0.3Mb Q10: 18 (36.0%) 0.1Mb Q12: 7 (14.0%) 0.0Mb Q15: 2 (4.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 2: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 14.3 (6925; e621bda1-0721-402e-9ca4-49cd42ff41cc) 4: 13.9 (5314; bad54463-4c8b-4007-8375-2df0687508f8) 5: 13.6 (3505; 00a29863-c29f-461c-bd56-1943fd2711a8) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)

I used LongReadSum to verify that this filtered BAM file only contains the primary alignments: image

kaichop commented 2 years ago

okay I see, thank you.

liuqianhn commented 2 years ago

@jonperdomo "I used LongReadSum to verify that this filtered BAM file only contains the primary alignments:" LongReadSum has a filter where you can set other alignments to be false and only the alignments with true will be considered. You can change the program to allow users to specify those filter setting so that you can focus on the primary alignments only.

jonperdomo commented 2 years ago

From this updated table it is clear that most differences in basic output statistics are due to using primary alignments only vs. primary + secondary, supplementary alignments. This should probably be specified in the output. Some remaining differences:

Tool File Read count Total bases Total mapped bases Number of aligned reads Number of unaligned reads N50
LongReadSum FASTQ 50 340189 NA NA NA 8731
NanoPlot FASTQ 50 340189 NA NA NA 8733
Mine FASTQ 50 340189 NA NA NA 8733
LongReadSum BAM 50 340189 304323 50 0 7415
NanoPlot BAM 106
493817 314962 NA NA 6398
Mine BAM 50 primary,
106 primary + secondary,supplementary
340189 primary,
493817 primary + secondary,supplementary
161334 primary,
314962 primary + secondary,supplementary
(via samtools)
50 0 8733 primary,
6398 primary + secondary,supplementary
PycoQC BAM + sequencing_summary.txt 50 340189 161334 50 0 Basecalls = 8730
Alignments = 3500
MinIONQC sequencing_summary.txt 50 340200 NA NA NA 8733