Implement basic statistics testing action

jonperdomo commented 2 years ago

Implement an action that will test basic statistics: Base count, read count, and N50.

jonperdomo commented 2 years ago

Good reference for Python CI workflow: Github Docs

jonperdomo commented 2 years ago

LongReadSum, NanoPlot, and my own MATLAB implementation yield identical results for N50 using sample input FASTA files (individual and combined). These will be used for testing the FASTA output:	File	N50
all_chr.hap1.cns.fa	23758872
all_chr.hap2.cns.fa	21856627
Both files	23544988

jonperdomo commented 2 years ago

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)

Tool	File	N50
LongReadSum	FASTQ	8731
NanoPlot	FASTQ	8733
Mine	FASTQ	8733
LongReadSum	BAM	7415
NanoPlot	BAM	6398
Mine	BAM	6398
PycoQC	BAM + sequencing_summary.txt	Basecalls = 8730 Alignments = 3500
MinIONQC	sequencing_summary.txt	8733

kaichop commented 2 years ago

Also check number of alignment (mapped and unmapped), total bases, total mapped bases. N50 is a value that is not too informative by itself; it is important in genome assembly, but not in characterizing sequenced reads. You can extract fastq/fasta file from BAM file as well using bedtools, which should help diagnose the discrepancies.

On Tue, Jun 14, 2022 at 4:41 PM Jonathan Perdomo @.***> wrote:

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File N50 LongReadSum FASTQ 8731 NanoPlot FASTQ 8733 Mine FASTQ 8733 LongReadSum BAM 7415 NanoPlot BAM 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733

— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1155691310, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBV3WI4VXD6W373MCTVPDU5XANCNFSM5XEWL7DA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonperdomo commented 2 years ago

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation)

Tool	File	Read count	Total bases	Total mapped bases	Number of aligned reads	Number of unaligned reads	N50
LongReadSum	FASTQ	50	340189	NA	NA	NA	*8731*
NanoPlot	FASTQ	50	340189	NA	NA	NA	8733
Mine	FASTQ	50	340189	NA	NA	NA	8733
LongReadSum	BAM	50	340189	304323	50	0	7415
NanoPlot	BAM	106	493817	314962	NA	NA	6398
Mine	BAM	50 primary, 106 primary + secondary,supplementary	340189 primary, 493817 primary + secondary,supplementary	161334 primary, 314962 primary + secondary,supplementary (via samtools)	50	0	8733 primary, 6398 primary + secondary,supplementary
PycoQC	BAM + sequencing_summary.txt	50	340189	161334	50	0	Basecalls = 8730 Alignments = 3500
MinIONQC	sequencing_summary.txt	50	340200	NA	NA	NA	8733
LongReadSum	FASTQ from BAM (via bedtools)	106	493817	NA	NA	NA	6397
NanoPlot	FASTQ from BAM (via bedtools)	106	493817	NA	NA	NA	6398

kaichop commented 2 years ago

Are the BAM results for NanoPlot directly reported by this software? What is the raw output? (does it offer any clues why read count is 106?)

On Sat, Jun 18, 2022 at 1:54 PM Jonathan Perdomo @.***> wrote:

Results from our 50-read sample dataset (Mine = My own MATLAB N50 implementation) Tool File Read count Total bases Total mapped bases Number of aligned reads Number of unaligned reads N50 LongReadSum FASTQ 50 340189 NA NA NA 8731 NanoPlot FASTQ 50 340189 NA NA NA 8733 Mine FASTQ 8733 LongReadSum BAM 50 340189 304323 50 0 7415 NanoPlot BAM 106 (=primary + supplementary alignments?) 493817 314962 NA NA 6398 Mine BAM 6398 PycoQC BAM + sequencing_summary.txt Basecalls = 8730 Alignments = 3500 MinIONQC sequencing_summary.txt 8733 LongReadSum FASTQ from BAM (via bedtools) NanoPlot FASTQ from BAM (via bedtools) Mine FASTQ from BAM (via bedtools)

— Reply to this email directly, view it on GitHub https://github.com/WGLab/LongReadSum/issues/9#issuecomment-1159524146, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEIBBFDMEDXPKDDHO3VPYENTANCNFSM5XEWL7DA . You are receiving this because you commented.Message ID: @.***>

jonperdomo commented 2 years ago

No, I run NanoPlot separately to compare results. This is the raw NanoPlot output for the BAM file:

General summary:
Average percent identity: 87.4 Fraction of bases aligned: 0.6 Mean read length: 4,658.7 Mean read quality: 9.7 Median percent identity: 88.3 Median read length: 3,579.5 Median read quality: 9.2 Number of reads: 106.0 Read length N50: 6,398.0 STDEV read length: 3,829.5 Total bases: 493,817.0 Total bases aligned: 314,962.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 106 (100.0%) 0.5Mb Q7: 99 (93.4%) 0.5Mb Q10: 32 (30.2%) 0.2Mb Q12: 15 (14.2%) 0.1Mb Q15: 5 (4.7%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 17.9 (1239; bf3ca0c2-c10a-40f5-bf35-4b6d2ba8479a) 2: 17.4 (3242; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 16.9 (3103; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 4: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 5: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)

LongReadSum has more information on read counts for alignments, which indicate that possibly the supplementary aligments are being included in the total read count:

jonperdomo commented 2 years ago

I tested this by filtering out unmapped, secondary, and supplementary alignments via the SAMTools view function. When filtered out, the number of reads output by NanoPlot is correct (=50):

General summary:
Average percent identity: 91.7 Fraction of bases aligned: 0.5 Mean read length: 6,803.8 Mean read quality: 10.1 Median percent identity: 93.2 Median read length: 5,194.0 Median read quality: 9.6 Number of reads: 50.0 Read length N50: 8,733.0 STDEV read length: 4,256.2 Total bases: 340,189.0 Total bases aligned: 161,334.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 50 (100.0%) 0.3Mb Q7: 50 (100.0%) 0.3Mb Q10: 18 (36.0%) 0.1Mb Q12: 7 (14.0%) 0.0Mb Q15: 2 (4.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 16.5 (6557; e7a151f2-7620-4d0e-9d84-41e8dc4cab5f) 2: 16.4 (7050; ff7bd978-ce99-42fb-a332-9fdf78f90701) 3: 14.3 (6925; e621bda1-0721-402e-9ca4-49cd42ff41cc) 4: 13.9 (5314; bad54463-4c8b-4007-8375-2df0687508f8) 5: 13.6 (3505; 00a29863-c29f-461c-bd56-1943fd2711a8) Top 5 longest reads and their mean basecall quality score 1: 19696 (7.5; a3e904d3-7f4d-4ba8-bb61-77505f026a1f) 2: 17825 (11.1; cf8eab27-821c-44bd-bfac-023299c37aed) 3: 16511 (9.7; 8631ced5-cfa0-4ff1-af7b-5abba3f96fa9) 4: 15263 (9.2; 2960e123-09ed-4596-ae15-10027f4b375f) 5: 14473 (8.5; ecee2aa6-765e-4ba1-a67f-7130f6e87517)

I used LongReadSum to verify that this filtered BAM file only contains the primary alignments:

kaichop commented 2 years ago

okay I see, thank you.

liuqianhn commented 2 years ago

@jonperdomo "I used LongReadSum to verify that this filtered BAM file only contains the primary alignments:" LongReadSum has a filter where you can set other alignments to be false and only the alignments with true will be considered. You can change the program to allow users to specify those filter setting so that you can focus on the primary alignments only.

jonperdomo commented 2 years ago

From this updated table it is clear that most differences in basic output statistics are due to using primary alignments only vs. primary + secondary, supplementary alignments. This should probably be specified in the output. Some remaining differences:

PycoQC seems to also include N50 for only the mapped bases in the reads (=3500), while other tools (including ours) calculate N50 for the whole read (=8730).

LongReadSum has different N50 outputs for BAM (8731) and FASTQ (7415), and number of mapped bases for BAM (304323)

Tool	File	Read count	Total bases	Total mapped bases	Number of aligned reads	Number of unaligned reads	N50
LongReadSum	FASTQ	50	340189	NA	NA	NA	*8731*
NanoPlot	FASTQ	50	340189	NA	NA	NA	8733
Mine	FASTQ	50	340189	NA	NA	NA	8733
LongReadSum	BAM	50	340189	*304323*	50	0	*7415*
NanoPlot	BAM	106	493817	314962	NA	NA	6398
Mine	BAM	50 primary, 106 primary + secondary,supplementary	340189 primary, 493817 primary + secondary,supplementary	161334 primary, 314962 primary + secondary,supplementary (via samtools)	50	0	8733 primary, 6398 primary + secondary,supplementary
PycoQC	BAM + sequencing_summary.txt	50	340189	161334	50	0	Basecalls = 8730 *Alignments = 3500*
MinIONQC	sequencing_summary.txt	50	340200	NA	NA	NA	8733

WGLab / LongReadSum

Implement basic statistics testing action #9