bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Genotyping amplicons? #30

Open adbeggs opened 4 months ago

adbeggs commented 4 months ago

Hi,

Thanks for a great tool. I am playing around with genotyping amplicon data from Nanopore sequencing. I can get Straglr to call certain STRs but not others, and I wonder if I need to do something differently. I have tried changing motifs, positions of the sequence, etc, but I have yet to be successful. Are there any suggestions you can give, please?

I've attached an example file aligned to grch38, and I'm running straglr (latest version) thus:

straglr.py barcode32.new.sorted.bam genome.fa batch1 --genotype_in_size --min_support 1 --loci strtest.bed --max_str_len 1000 --max_num_clusters 2 --nprocs 8

And I get:

#chrom  start   end repeat_unit allele1:size    allele1:copy_number allele1:support allele2:size    allele2:copy_number allele2:support
chr1    204156332   204156364   ACAG    31.0    7.8 8   -   -   -
chr11   2171086 2171116 TGAA    32.1    8.0 13  -   -   -
chr5    150076322   150076397   CTAT    68.4    17.1    139 -   -   -
chrX    134481492   134481561   TCTA    72.5    18.1    2   -   -   -
chrX    67545317    67545419    GCA 94.4    31.5    7   -   -   -

str.tar.gz [Uploading str.tar.gz…]()

readmanchiu commented 4 months ago

Thanks for trying Straglr. This is my first time seeing amplicon data, and the main issue is that each read can cover >1 locus. This violates my assumption of each read (the majority of it) covering 1 locus only when checking the alignment CIGAR string. A lot of noise will creep in if this screen on alignments were not made. A separate targeted amplicon mode will need to be implemented to handle this datatype.