bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Loci with multiple motifs at same position #38

Open christopher-schroeder opened 1 month ago

christopher-schroeder commented 1 month ago

I have called a bunch of samples with straglr and merged the (chrom, start, end, motif) beds into one unified bed. I would like to use this merged bed to genotype each sample again. Sometimes different motifs occur at the same position, but straglr only calles one of them multiple Times. Example:

Here is a position of the merged bed file with multiple motifs at the same position.

$ cat results/str_candidates/merged.bed | grep 56828456
Y   56828456    56836900    CATTC
Y   56828456    56836900    TCCAT

I use this bed for genotyping. What I see:

$ cat results/str/818-13.bed | grep 56828456
Y   56828456    56836900    TCCAT   9286.5  1857.3  55  2907.0  581.4   13
Y   56828456    56836900    TCCAT   9286.5  1857.3  55  2907.0  581.4   13

What I am expecting:

$ cat results/str/818-13.bed | grep 56828456
Y   56828456    56836900    CATTC   ... ... ... ... ... ...
Y   56828456    56836900    TCCAT   9286.5  1857.3  55  2907.0  581.4   13
christopher-schroeder commented 1 month ago

Now I recognized that the motifs are basically the same but only shifted.

I provide a more complicated example:

$ cat results/str_candidates/merged.bed | grep 8500613
1   8500613 8502586 GCCGCCCCGTCCGGGAGGGAGGGGCCCCTCTGCCCG
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCGTCTGGGAAGTGAGGAGC
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT
1   8500613 8502586 CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAG
1   8500613 8502586 CCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1   8500613 8502586 CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1   8500613 8502586 GCCGCCCCGTCCGGGAAGTGAGGGGCCCCTCTGCCCG
cat results/str/818-14.bed | grep 8500613
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC    2448.0  61.2    4   932.0   23.3    5
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1420.0  28.4    8   305.0   6.1 1
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1295.0  25.9    9   -   -   -
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1295.0  25.9    9   -   -   -
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1295.0  25.9    9   -   -   -
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC    2448.0  61.2    4   976.0   24.4    5
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC    2448.0  61.2    4   952.0   23.8    5
1   8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC    2448.0  61.2    4   952.0   23.8    5

Especially line 2 and 3 of the output confuse me.

1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1420.0  28.4    8   305.0   6.1 1
1   8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT  1295.0  25.9    9   -   -   -

The same motif on the same locus but with different genotypes for one sample.

readmanchiu commented 1 month ago

when you specify the the loci in the bed file, there should one only be one locus (position) per line. If you expect multiple motifs, you should use some regex for STRs for the last (4th) column in the bed file. For example, if you expect CGG and CAG to show up in the same locus, you can use "CG". If you have no clue what to expect, you can use some very generic pattern, like "GC*", which may work for your VNTR example.

christopher-schroeder commented 1 month ago

So you are only allowed to put in a a position once and if there might be different motifs at a single position I have to use regexp? Why would you implement it in such way? We are also searching for pathogenic motif changes in STRs so especially loci where samples show different motifs are of interest. So I know exactly what motif I am searching for and I also know the position. But sometimes I have several motifs and would like to know if this occurs in a sample and if so how much copies.

The usecase is that if you detect an STR, you would like to know if the detected copy number is very much outside the norm. For that you have to know the norm and genotype the other (control) samples to get this information.

readmanchiu commented 1 month ago

Thanks, @christopher-schroeder for the info. The idea of the input bed file is that each line corresponds to a unique locus; however, flexibility is allowed at the motif (4th column). So you you basically put "a" if you only are sure there is an A in the repeat. This becomes silly so I would probably make a new release such that the motif will be optional - empty column allowed. There will be no screening against a pre-defined motif. I think by removing this screening against pre-defined motifs will allow everything (regardless how they different from the reference) that is detected from the sequences to be reported.