Open markb729 opened 3 years ago
This error does not look like an issue with the number of reads, but it is curious that it only happens at >50k reads. It looks to me like the script is having trouble finding quality values for some record(s). Does your input bam have quality values? Is it made up of multiple concated bam files, some made from fasta inputs -- this might cause apparent size issues if all of the records up to some point do have quality values, and then there is one or more without beyond that.
if this is a not-found quality issue, then I can fix that relatively easily.
I too agree that the read number should not be the issue. Quality values are present in the input bam of ccs reads (ascii_base=33 and rq tag). This is what I've been doing using demuxed ccs read output:
pbmm2 align --sort --preset CCS genome_reference.fa m64120_210106_232659.ccs.demux.bcAd1118T--bcAd1118T.bam aligned.bam
example of alignment:
m64120_210106232659/133109/ccs 0 genomicTarget 1 60 2447S644=16X1101=1X425=2X92=1X171= * 0 0
\<GATC>.snip \.snip_ bx:B:i,17,17 ec:f:17.1517 np:i:15 rq:f:0.999792 sn:B:f,10.5961,15.0609,4.20448,7.52336 we:i:5392514 ws:i:0 zm:i:133109 qs:i:17 qe:i:4917bc:B:S,71,71 bq:i:100 cx:i:12 bl:Z:AGATACACATGATACTT bt:Z:AAGTATCATGTGTATCT ql:Z:~} ~~~qt:Z:~~~~~ RG:Z:f1e8afc2/71--71 mc:f:99.1847 rm:i:1 SA:Z:genomicTarget ,2,+,2447M2453S,60,21;>
The expected quality (and other) tags are there, including the rq tag (bolded above), for all input reads. After filtering for primary alignments only (I considered that secondary and/or supplementary alignments may be causing problems), rq tags match the expected number,
samtools view aligned.bam | grep -o -P 'rq:f:.{0,8}' | wc -l
Even so, ClusterAmplicons.py fails using 537,771 (primary) alignments but when subsampled to 10 percent alignments
samtools view -s 0.10 -o aligned_0.01.bam aligned.bam
it works perfectly well. Subsampling to 20 percent alignments results in failure, throwing the statistics.StatisticsError: mean requires at least one data point.
I checked your example file bc1019--bc1019.mapped.bam and found many fewer Pacbio-specific tags - could this be a parsing issue? Admittedly, probably not because it works with smaller numbers of alignments.
I appreciate your attention to this issue.
Could it be a read with a deletion spanning your region?
It seems to be something quite similar to a deletion - alignments that are shorter than the defined region, when used, including the 100bp flanking sequences, create an unhandled exception.
samtools view aligned.bam | awk '{print length($10)}' | sort -un | more
gives me some really short - but valid - ccs read alignments,
288 368 371 584 687 757 822 834 852 881 940 948 ...
one of which is shorter than my defined region (111 bp + 200 bp flanks). But when filtered, for example, requiring full alignments defined by the known amplicon length,
samtools view -h aligned.bam | awk 'length($10) > 2399 || $1 ~ /^@/' | samtools view -O BAM -o aligned_2.4K.bam -
2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415
all is well
Reading Sequence Trimming kmers top,bottom [0.1, 0.9] Exporting kmer counts Normalizing data Reducing Features with pca Clustering 496372 reads with dbscan eps=0.001 min_samples=5 metric=euclidean
Perhaps add a boundary check or indicate that short alignments will cause problems.
I see that pbaa may be an in-the-works replacement but your version is very feature-rich, especially the selectable region option. Nice work.
Thanks for digging deeper -- I pushed a change that should at least prevent the error. For now, I have just added an error handler to catch anything thrown while trying to access query values of a subset of a read. You'll get a warning and the error printed to the screen, but the read will otherwise be skipped without crashing the run. Apologies, I didn't have time to test this, so let me know if I need to take another look.
You could also continue to use what you have but add the option --minQV 0
which will just sidestep the function where this error occurs. The way I have this written, the default minQuality is set to hifi (0.99), and when you also apply the extraction protocol to snip out the target region for analysis, it applies that filter to the subsequence quality as well, which has to be calculated from the quality string. I'm not sure that's the best default now that you have brought this case up -- it might be better to set default to 0. What do you think?
Thanks for using this tool, I'm happy it is helpfull! Yep, much of our development effort lately has been on pbaa, and we're hoping to eventually cover as many long amplicon applications with that tool as possible. I wrote these as a sort of workbench for testing ideas. If you find features here that are particularly useful to you and are currently not in pbaa, please let me know and we can put in in our feature request list. We try to be as responsive as we can!
Is there a limit to how many reads that can be processed in ClusterAmplicons.py? Without much testing, it appears that if I exceed 50K reads (or so; no finer resolution tested) the following invocation
python3 pbampliconclustering/ClusterAmplicons.py cluster -e 0.001 -r 'genomicTarget:2182-2293' --extractReference genome_reference.fa -p subsample100K_extracted -b subsample100K.bam -X -F
will fail throwing
Alignment files with few that 50K reads (or so) complete without error. Clean python 3.8 env with mappy 2.17 matplotlib 3.3.4 numpy 1.19.5 pandas 1.2.1 pysam 0.16.0.1 scikit-learn 0.24.1 seaborn 0.11.1