PacificBiosciences / pbampliconclustering

exploratory scripts for clustering ccs amplicon data
Other
10 stars 4 forks source link

Read number processing limitation? #2

Open markb729 opened 3 years ago

markb729 commented 3 years ago

Is there a limit to how many reads that can be processed in ClusterAmplicons.py? Without much testing, it appears that if I exceed 50K reads (or so; no finer resolution tested) the following invocation

python3 pbampliconclustering/ClusterAmplicons.py cluster -e 0.001 -r 'genomicTarget:2182-2293' --extractReference genome_reference.fa -p subsample100K_extracted -b subsample100K.bam -X -F

will fail throwing

Reading Sequence Traceback (most recent call last): File "pbampliconclustering/ClusterAmplicons.py", line 133, in args.func(args) File "pbampliconclustering/src/main.py", line 28, in main data = loadKmers(inFile,args.minQV,args.kmer, File "pbampliconclustering/src/model/kmer.py", line 106, in loadKmers seqDB = pd.DataFrame([{'qname' :rec.query_name, File "pbampliconclustering/src/model/kmer.py", line 106, in seqDB = pd.DataFrame([{'qname' :rec.query_name, File "pbampliconclustering/src/utils/extract.py", line 33, in extractRegion yield SimpleRecord(name,subseq,start,end,qual,rec.flag) File "pbampliconclustering/src/utils/extract.py", line 91, in init self.rq = self._getRQ(self.query_qualities) File "pbampliconclustering/src/utils/extract.py", line 96, in _getRQ return mean([1-10**(-q/10) for q in phred]) File "/mnt/software/anaconda/envs/PacbioLAA/lib/python3.8/statistics.py", line 315, in mean raise StatisticsError('mean requires at least one data point') statistics.StatisticsError: mean requires at least one data point

Alignment files with few that 50K reads (or so) complete without error. Clean python 3.8 env with mappy 2.17 matplotlib 3.3.4 numpy 1.19.5 pandas 1.2.1 pysam 0.16.0.1 scikit-learn 0.24.1 seaborn 0.11.1

jrharting commented 3 years ago

This error does not look like an issue with the number of reads, but it is curious that it only happens at >50k reads. It looks to me like the script is having trouble finding quality values for some record(s). Does your input bam have quality values? Is it made up of multiple concated bam files, some made from fasta inputs -- this might cause apparent size issues if all of the records up to some point do have quality values, and then there is one or more without beyond that.

if this is a not-found quality issue, then I can fix that relatively easily.

markb729 commented 3 years ago

I too agree that the read number should not be the issue. Quality values are present in the input bam of ccs reads (ascii_base=33 and rq tag). This is what I've been doing using demuxed ccs read output:

pbmm2 align --sort --preset CCS genome_reference.fa m64120_210106_232659.ccs.demux.bcAd1118T--bcAd1118T.bam aligned.bam

example of alignment:

m64120_210106232659/133109/ccs 0 genomicTarget 1 60 2447S644=16X1101=1X425=2X92=1X171= * 0 0
\<GATC>.snip \.snip_ bx:B:i,17,17 ec:f:17.1517 np:i:15 rq:f:0.999792 sn:B:f,10.5961,15.0609,4.20448,7.52336 we:i:5392514 ws:i:0 zm:i:133109 qs:i:17 qe:i:4917bc:B:S,71,71 bq:i:100 cx:i:12 bl:Z:AGATACACATGATACTT bt:Z:AAGTATCATGTGTATCT ql:Z:~}~~~ qt:Z:~~~~~ RG:Z:f1e8afc2/71--71 mc:f:99.1847 rm:i:1 SA:Z:genomicTarget ,2,+,2447M2453S,60,21;>

The expected quality (and other) tags are there, including the rq tag (bolded above), for all input reads. After filtering for primary alignments only (I considered that secondary and/or supplementary alignments may be causing problems), rq tags match the expected number,

samtools view aligned.bam | grep -o -P 'rq:f:.{0,8}' | wc -l

Even so, ClusterAmplicons.py fails using 537,771 (primary) alignments but when subsampled to 10 percent alignments

samtools view -s 0.10 -o aligned_0.01.bam aligned.bam

it works perfectly well. Subsampling to 20 percent alignments results in failure, throwing the statistics.StatisticsError: mean requires at least one data point.

I checked your example file bc1019--bc1019.mapped.bam and found many fewer Pacbio-specific tags - could this be a parsing issue? Admittedly, probably not because it works with smaller numbers of alignments.

I appreciate your attention to this issue.

jrharting commented 3 years ago

Could it be a read with a deletion spanning your region?

markb729 commented 3 years ago

It seems to be something quite similar to a deletion - alignments that are shorter than the defined region, when used, including the 100bp flanking sequences, create an unhandled exception.

samtools view aligned.bam | awk '{print length($10)}' | sort -un | more

gives me some really short - but valid - ccs read alignments,

288 368 371 584 687 757 822 834 852 881 940 948 ...

one of which is shorter than my defined region (111 bp + 200 bp flanks). But when filtered, for example, requiring full alignments defined by the known amplicon length,

samtools view -h aligned.bam | awk 'length($10) > 2399 || $1 ~ /^@/' | samtools view -O BAM -o aligned_2.4K.bam -

2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415

all is well

Reading Sequence Trimming kmers top,bottom [0.1, 0.9] Exporting kmer counts Normalizing data Reducing Features with pca Clustering 496372 reads with dbscan eps=0.001 min_samples=5 metric=euclidean

Perhaps add a boundary check or indicate that short alignments will cause problems.

I see that pbaa may be an in-the-works replacement but your version is very feature-rich, especially the selectable region option. Nice work.

jrharting commented 3 years ago

Thanks for digging deeper -- I pushed a change that should at least prevent the error. For now, I have just added an error handler to catch anything thrown while trying to access query values of a subset of a read. You'll get a warning and the error printed to the screen, but the read will otherwise be skipped without crashing the run. Apologies, I didn't have time to test this, so let me know if I need to take another look.

You could also continue to use what you have but add the option --minQV 0 which will just sidestep the function where this error occurs. The way I have this written, the default minQuality is set to hifi (0.99), and when you also apply the extraction protocol to snip out the target region for analysis, it applies that filter to the subsequence quality as well, which has to be calculated from the quality string. I'm not sure that's the best default now that you have brought this case up -- it might be better to set default to 0. What do you think?

Thanks for using this tool, I'm happy it is helpfull! Yep, much of our development effort lately has been on pbaa, and we're hoping to eventually cover as many long amplicon applications with that tool as possible. I wrote these as a sort of workbench for testing ideas. If you find features here that are particularly useful to you and are currently not in pbaa, please let me know and we can put in in our feature request list. We try to be as responsive as we can!