bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
61 stars 9 forks source link

Repeat Interruptions #28

Open stefandiederich opened 10 months ago

stefandiederich commented 10 months ago

Hi,

I was wondering if stragler is able to deal with repeat interruption. For some diseases we can see interruptions of the repeats. Like for Huntington, where the normal repeat CAGCAGCAG... can be interrupted by CAA. Is there a way to pass this to stragler?

Bests Stefan

readmanchiu commented 10 months ago

I guess you are interested in running Straglr in targeted mode i.e. passing the locus info to --loci In this case, you can specify the motif as "CA*" for the Huntington locus. By doing this, it will try to use Python's regular expression instead of tandem repeat finder (TRF) to find the motif. You can also try just specifying "CAG" as the motif, the default TRF parameters may be lenient enough to allow the repeat with interruptions to be captured. The exact motif detected by TRF, which may incorporate the interruption, is reported in the TSV output.

MaestSi commented 10 months ago

Dear @readmanchiu , I have a very similar issue. I am running Straglr and I am trying to count the number of "CAG" repeats in each read. In particular, I specified "CAG" as the repeat motif in the --loci bed file. When at least one of the two alleles has a high number of "CAG" copies, the reported repeat_unit is "CAG" as expected. However, for short interrupted alleles, I am getting a different motif as the repeat_unit, i.e. a longer motif incorporating the "CAG" and the interruption. Based on TRF temporary files, it seems to me that "CAG" motif is evaluated too, but the one reported has a higher score. Is there a way for me to force the repeat_unit to "CAG"? I was thinking of the following strategies. 1- Use some kind of regular expression in the --loci bed file to restrict the motif to CAG. In this case, would "CA*" be suitable? 2- Since the repeat interruption is at a known coordinate, I may consider restricting the region in the --loci bed file to the genomic region not including the CAA interruption, and manually adding the number of copies afterwards. 3- Use genome-scan mode and set --max_str_len = 3.

Which of the above strategies would you advice? Thanks, SM

readmanchiu commented 10 months ago

Thanks for trying Straglr, @MaestSi It has been my assumption that users would want to be notified about interruptions (like the first user) so I thought Straglr reporting these impure motifs is a helpful thing. Specifying CA* as the motif actually serves what the first user wants, which is to include interruptions or unexpected slight changes in motifs (although he has replied whether it works for him or not). In your case, it seems like you only care about the number of CAG copies. It may be tricky because if the repeat tract littered with interruptions actually spans the bounding coordinates, Straglr would automatically pick this over other motifs (e.g. pure CAG), if runs of CAGs only occupy a sub-section of the sequence between the boundaries. Yes, the third strategy of setting --max_str_len 3 may work. You may also try playing with the TRF parameters like make it more stringent such as --trf_args 2 7 7 80 10 50 500 (the parameters on the TRF manpage) to see if it works. Feel free to paste , if possible, the repeat sequence you observed if none of the solutions work. I will see if there is any other ways to just squeeze out the CAG copy numbers.

MaestSi commented 10 months ago

Hi, I just did a quick test with Straglr 1.4.1, specifying --trf_args 2 7 7 80 10 50 500 parameter and using CA* as the motif in the --loci bed file. However, the program stopped with the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/envs/straglr/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/opt/conda/envs/straglr/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/workdir/straglr/src/tre.py", line 1189, in get_alleles
    variants.extend(self.extract_alleles_regex(generic,
  File "/workdir/straglr/src/tre.py", line 757, in extract_alleles_regex
    rpos = read_seq.find(matched_seq)
AttributeError: 'NoneType' object has no attribute 'find'

Do you know if there is a way to solve this? P.s.: I used both different parameters for TRF and the regex in the motif because I would like Straglr to be able to count all CAG and CAA repeats I have in my region, but I would not like it to extend further. Thanks, SM

readmanchiu commented 10 months ago

did you run minimap2 with soft-clipping (-Y) enabled? seems like Straglr was trying to extract the potential repeat sequence from the bam file but failed. You should trying testing out the trf_args separately because Stralgr either uses TRF or regex (which is indicated by the * in the specified motif) to determine the repeat

MaestSi commented 10 months ago

Dear @readmanchiu, actually I forgot to run minimap2 with -Y parameter, thank you for pointing this out. I was then able to run straglr exploiting the regex for specifying the repeat motif and it worked as expected, thank you also for this suggestion. I just noticed what may be a small bug: in case the reference sequence has two homologous regions, and one read has a primary alignment on one of the two, and a supplementary alignment on the other region, straglr may not report the chrom in the region where the primary alignment is, but where the supplementary alignment is instead. This may not be the desired behaviour. For example, I had to do post-run filtering based on the bam file. Thanks again, SM

readmanchiu commented 10 months ago

if you don't mind posting the SAM records of the 2 alignments, it may help me understand why the supplementary alignment was picked over the primary one.

stefandiederich commented 10 months ago

Hi @readmanchiu

sorry for my late response. I tried the solution with regex and it seems to work for HTT. I am now facing an other problem with the loci file.

There is a repeat on chr1:1371179-1371198 in VWA1 (GGCGCGGAGC). When I define exactly this region in the loci bed file I am getting a different repeat as output:

#chrom  start   end repeat_unit allele1:size    allele1:copy_number allele1:support allele2:size    allele2:copy_number allele2:support
chr1    1371179 1371198 GCGGCGG 16.099999999999998  2.3 2   -   -   -

When I define this loci with al little padding e.g. chr1:1371170-1371250 I am getting the correct repeat:

#chrom  start   end repeat_unit allele1:size    allele1:copy_number allele1:support allele2:size    allele2:copy_number allele2:support
chr1    1371170 1371250 GGCGCGGAGC  21.0    2.1 2   -   -   -

The command I used was: straglr.py 0091-22.bam /raid/Referenzgenome/HG19_PAR/HG19.karyo.PAR.fasta 0091-22.rep --loci /media/Data_SRV3/Straglr_Repeats.korrigiert.sorted.bed --nprocs 50 --min_ins_size 10

Do I always have to put a little padding arround the repeat regions defined in loci file? Or am I doing something wrong?

Bests Stefan

MaestSi commented 10 months ago

Hi @readmanchiu , actually I realised every time a read has multiple alignments, all of them are reported in Straglr output file, while for my use case it would be better to report only the primary alignment. I mean, soft-clipped sequences, in this context, are usually STR which may align to many other genomic regions, but I don't care about their genomic location, as I only want their sequence (i.e. the soft-clipped of the primary alignment) to be used once.

P.s.: similarly to what @stefandiederich is experiencing, I am also getting slightly different repeat counts (e.g. 6 instead of 7) in case I also include a couple of flanking bases in the bed file or not. Are there any guidances or criteria for this? Thanks, SM

readmanchiu commented 10 months ago

Hi @stefandiederich I don't think you need to always pad your coordinate to get the desired result. You happened to get your "desired" motif when you add some padding as there is some low-complexity GC's surrounding your target region. The fact the reads are noisy, and there is low-complexity regions flanking your target locus may not always give you the correct genotype if you provide your coordinates. You can again check the temporary TRF outputs to see how the repeats differed when you gave the 2 different coordinates. It may actually be more helpful if you post the TSV outputs as you can see the actual motif detected(chosen) for each read.

readmanchiu commented 10 months ago

@MaestSi, one read should only be used once as support for one locus despite it has multiple alignments. The same read may be used as support for more than 1 locus though. But if you notice a read that is used as support for loci on different chromosomes, then that shouldn't happen. I'll need some data to debug that. Like my response to @stefandiederich, flanking low-complexity sequences may affect genotyping results.