bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
59 stars 9 forks source link

Discrepancy allele length and sequence #42

Open ljohansson opened 2 months ago

ljohansson commented 2 months ago

Dear @readmanchiu

I am currently using Straglr 1.5.0 and produced the following output for the FGF14 locus. If I am correct there is a discrepancy in the sequence shown in the vcf and the AL/ALR and AC/ACR. If my breakdown is correct then. GT = 0/1 (meaning that there is one REF allele and one ALT allele) DP = 187 AL=33.6/995.8 ALR=13-44/78-1216 AC=11.2/331.9 ACR=4.3-14.7/26.0-405.3 AD=106/72 No ALT_MOTIF (./.)

If I look at the sequences I see reference sequence 50 GAA repeat units and alternative sequence around 11 copies, matching the first allele. Combined with the 0/1 GT this would make a sample with two repeats, one of 11 RU and one of 50 RU. However, I was expecting a sequence of around 332 RU, fiven the AC values. EDIT: could it be because the ref length (50) falls within the range 26.0-405.3? If so, then this is incorrectly so in my opinion. There are only a few reads supporting an intermediate length allelle with a length close to the reference length.

contig=

INFO=

INFO=

INFO=

INFO=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT .

chr13 102161576 . GAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA AAGAAAGAAGAAGAAGAAGAAGAAAGAAGAAGAA . PASS LOCUS=FGF14;END=102161726;MOTIF=GAA;COPIES=50.0 GT:DP:AL:ALR:AC:ACR:AD:ALT_MOTIF 0/1:187:33.6/995.8:13-44/78-1216:11.2/331.9:4.3-14.7/26.0-405.3:106/72:./.

A slice of the middle of the tsv file also shows the two 11.2 and 331.9 alleles:

chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 30da0ee0-9108-47c3-a4dd-77be7b1e0c35 GAA 328.3 985 8588 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) ba2e5868-6ae0-4136-bf1f-413a26c2ac31 GAA 326.3 979 8554 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 22143951-e646-46cf-8017-7c50e468c9ea GAA 317.7 953 8590 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 24631e93-98a1-4790-8578-f8b84416d19b GAA 301.3 904 8578 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 9828cb5e-226d-4448-a374-358c8f7ab486 GAA 297.7 893 8545 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 1551aae9-d9a3-4d0c-b9bf-681af0af2028 GAA 223.7 671 8588 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) c9886ee4-15ab-430e-8afe-230a6a94be26 GAA 166.3 499 8624 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 9de80113-5231-493f-bd50-9332f42050e3 GAA 102.3 307 8588 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 23150b37-212f-48f7-8718-bc2ade8c062c GAA 74.3 223 8602 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 5ba7e228-b2dc-4854-8057-cfeb2ff59c43 GAA 70.3 211 8583 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 27fd23f4-7a1d-4886-943d-5929ddd83094 GAA 53.7 161 8586 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 26cb932f-71e3-4e80-8c87-9aaa2e578082 GAA 39.0 117 8670 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 498288d8-f4e4-4b2c-8a1d-c21a60a3a3ef GAA 26.0 78 8605 + 331.9 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) c8aaed8f-7d96-472b-8092-b08f23a2ebfc GAA 14.7 44 8494 + 11.2 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) 18a898df-0f87-4326-ab29-22b4335b265e GAA 14.3 43 8518 + 11.2 full chr13 102161576 102161726 AAG chr13:102161576-102161726 187 331.9(72);11.2(106) ca73a40e-fce8-451b-9a0d-751c4b90e897 GAA 13.7 41 6430 - 11.2 full

And the bed results:

chrom start end repeat_unit allele1:size allele1:copy_number allele1:support allele2:size allele2:copy_number allele2:support

chr13 102161576 102161726 AAG 995.6999999999999 331.9 72 33.599999999999994 11.2 106

These are the 26.0 copy read details (not sure if they help):

Read name = 498288d8-f4e4-4b2c-8a1d-c21a60a3a3ef Read length = 10.909bp Flags = 0 Mapping = Primary @ MAPQ 60 Reference span = chr13:102.152.965-102.163.897 (+) = 10.933bp Cigar = 28S44M2D44M1D16M1D22M1D13M3D52M1I78M2D72M1D113M2D68M1I96M1D29M3D98M1D3M2I1M1D5M1I16M1I52M2D118M1D107M1D206M1I738M1D99M3D320M1I137M1I665M1D41M2D375M1D612M1D57M2I4M3D342M1I719M1D172M3D135M1I175M2I292M1D385M1D2M1D13M3D116M1D135M2I6M1I3M1D5M2D49M1D72M1D22M2D25M1I63M2D19M1D16M1I62M1D80M1I11M3I67M1D54M1D7M1D88M2D33M2I51M1I85M1D39M1I44M1D86M1D10M1D253M1D148M2I29M1I8M1I15M1I37M1I146M1D57M1D51M1D11M1D155M1D3M1D23M1D7M1D2M1I76M3I81M2D8M1D72M1D61M1D20M1D58M1D24M1I4M2D40M2D19M1I166M1D55M1D151M1D19M1D177M1D91M1I37M1I118M2I33M1I18M2D1M2D8M1D61M1D49M1D134M1I60M1I101M1D11M1I119M1D167M2D63M2S Clipping = Left 28 soft; Right 2 soft s1 = 6690 s2 = 96 NM = 244 AS = 20350 de = 0.0188 rl = 4406 cm = 1032 nn = 0 tp = P ms = 20383 Hidden tags: MD, RG Location = chr13:102.161.586 Base = G @ QV 29

Alignment start position = chr13:102152965 GATGTGCTTCGTTCAGTTACGTATTGCTCCGTGGCAGGCCCAGGTCCTTGCTAGCATCAGGCTGTGCACAGCAGGTGGAGCTGCTGGGAG....AACTGTGGCTGTCTTGTCACCACTGTATCCCTAGTGTCTGCTGGCTAAAGAGATGAGAAAACAGAAAATATTTCTTTCAGAGAGCAAATGACCGCAATCGTCAGTCAGTGTAAGCTTGCCTTTGCAAATGAAGGAAACTCTTATCTTAGTTGTAAAATATCAATATTCTCTATGCAACCAACCCTTGTGAAGGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGAATCGTTTGGGAGTTCATCGATGAGATGAGATCATCGTGGACGGGACCTGCTTAATTCATCATTGCGTTCCCAATATAGTGCGCCTAGAAGATAATTAAAAAAAAAAAAAAAAACCCTGGAGAATGAGTGAATGACATTGCTCACTATATGGACCTAAATTTAATCTCTTTATCAAGGAAAATTATAACTGTTTTCTTTTTTATCTAA...

In IGV it looks like this image

The subsetted read is the same one as in the normal cram file, but extracted to better see the GAA/GGA pattern.

Can you help me on how to interpret these results?

ljohansson commented 2 months ago

In addition, also in other cases the count and the sequence don't perfectly add up. For instance the second ALT allele in the example below is 771 bp long if you count the A's and G's. However, I would have expected it to be 814 nucleotides. Both lenghts however fall within the range: 738-909. Are sequence and count determined based on different consensus sequences?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT .

chr13 102161576 . GAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA AAGAAGAAGAAGAAGAAGAAGAAGAAGAA,GAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA . PASS LOCUS=FGF14;END=102161726;MOTIF=GAA;COPIES=50.0 GT:DP:AL:ALR:AC:ACR:AD:ALT_MOTIF 1/2:14:30.2/814.0:26-47/738-909:10.1/271.3:8.7-15.7/246.0-303.0:8/5:./.

readmanchiu commented 2 months ago

Thanks for reporting @ljohansson, really appreciate it.

For the first case, there should be 2 ATL alleles, one shorter (11 RU) and one longer (332 RU) than the reference (50 RU). The GT will be "1:2" instead of "0:1". I give a buffer of 10% (of size, not copy number) to determine if an allele is a REF allele. In this case, both alleles should be ALTs, and both sequences should be reported in the ALT field, separated by a comma. In your VCF I only see the short allele (11 RU) in the ALT column, so there is a bug somewhere.

For the second case, the GT field is actually correct (1:2) as neither allele matches the REF, and both short and long sequences are displayed in the ALT field. The ALT sequence reported is actually chosen from one the support reads that matches closest in length (bp) to the assigned genotype. It seems 814 (assigned) and reported length (772) is pretty off. Can you post the TSV output of the second case so I can check if this (that 772 is the closest) is indeed the case.

So after seeing the second case (which is similar to the first case), I don't know why the first case reported the wrong genotype (0:1 instead of 1:2) and missed reporting the longer sequence in the ALT field

readmanchiu commented 2 months ago

after a second look, l figured out the reason. Your guess is right, the script is using the lower bound of the size range (26, and minus 10%) to determine the compare against the reference length. That's why it thinks the second allele is thought to be REF size! It's somewhat unexpected that the 26 allele is somehow clustered with the 3-400 ones, showing the GMM clustering is unreliable when there is a wide spread of numbers. (I guess you are doing a target sequencing experiment with lots of coverage) Anyways I'll change the code to not use the size range from the reads, but just the assigned allele size, which hopefully is more reliable!

ljohansson commented 2 months ago

Hi @readmanchiu: Here are the reads from the tsv for the second (1:2) case. There are five reads supporting the longer allele. The middle one here is 769 bp long and is closest to 814. Probably to complete the pattern and make it dividable by 3 two nucleotides have been added to make it 771. So I guess your reasoning is correct.

chrom start end target_repeat locus coverage genotype read_name actual_repeat copy_number size read_start strand allele read_status

chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 5b7ec1a6-a214-46ea-bae3-5699eff6a17f GAA 303.0 909 8609 + 271.3 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 5d578ec4-1067-44da-804b-359a1b509707 GAA 297.3 892 8588 + 271.3 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 715d673c-7761-4c07-bc5a-92389608dedc GAA 256.3 769 6433 - 271.3 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) af324df4-e5f4-4853-a262-ab5288096de4 GAA 254.0 762 8598 + 271.3 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) a26e6516-1d21-4773-8d34-931a76646b73 GAA 246.0 738 6453 - 271.3 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) efb6742f-6426-4346-ba5e-207326126942 GAA 15.7 47 8641 + 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 41f84870-99a0-49dd-add9-db3850e5b615 GAA 9.7 29 8646 + 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) c8cf9f8f-ad1d-471b-9c68-27a587167471 GAA 9.7 29 8596 + 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 0082c565-52ed-4d27-b7bd-5af9d486ee37 GAA 9.7 29 6466 - 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 33bed0eb-de21-46cd-9349-9eea116a7056 GAA 9.7 29 6500 - 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) bd5b1093-4ed9-40fd-8af9-8aff364c67da GAA 9.0 27 6327 - 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) f15583b7-6207-467d-bade-8df86a3fab78 GAA 8.7 26 8620 + 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 8735bd4c-5ccd-4e0f-b080-0a62a323fd04 GAA 8.7 26 6454 - 10.1 full chr13 102161576 102161726 AAG chr13:102161576-102161726 14 271.3(5);10.1(8) 10fbc49c-c744-44f1-91ee-65d933400683 NA NA NA NA NA NA skipped (not_spanning)

ljohansson commented 2 months ago

You are correct that the current experiment is a targeted one. It was a region cut out using Cas9 and enriched and sequenced using a MinION. Therefore coverages are quite high in some cases, although not so much in the second case.

I am not sure what would be a good implementation, but I have understood from a laboratory specialist that expanded STRs can be quite unstable and vary in length within a single person, so I guess a high spread may be present more often. However, in the current use-case given the read sequence, the '26' read seems off as a whole.

the ALT sequence reported is actually chosen from one the support reads that matches closest in length (bp) to the assigned genotype. It seems 814 (assigned) and reported length (772) is pretty off.

Do I understand correctly that it is not a consensus sequence in the VCF, but the sequence of a single selected read, including possible sequence errors unique for that read?

readmanchiu commented 2 months ago

You are right, the ALT sequence is extracted from one of the support read sequences. Yes, there may be sequencing errors. A to-do item would be to generate a consensus sequence.

readmanchiu commented 2 months ago

Hi @ljohansson, I updated the code to use the interquartile range of the support read sizes for comparing against the reference size, in the hope that it will not use the outliers included erroneously from clustering. If you have time, you can check out the new code (just clone the repo, I haven't made a release yet) to see if it make a difference for your case.

ljohansson commented 2 months ago

Hi @readmanchiu Thank you for the update. Unfortunately I will only be able to test the new code in a few weeks time. I guess this should do the trick for my current case.

ljohansson commented 6 days ago

Dear @readmanchiu, My apologies for my delayed answer. I am currently looking at straglr 1.5.1. Unfortunately there may still be issues with clustering. I have not yet looked at the case discussed above, however I have an issue with another case (tsv pasted below). This case has two alles with lenth ~350 and ~300 (from a three bp RU perspective, half when counting the 6bp RU (~175/~150). However, both of these alleles are clustered together and two outlier reads are considered to be the second allele. Notice the drop in length from 173.0 to 157.3 in the middle. The two outlier reads seem to be part of the longer allele when looking at the cram (see highlighted reads in the IGV viewer image below).

chrom start end target_repeat locus coverage genotype read_name actual_repeat copy_number size read_start strand allele read_status

chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 540b1f42-ed44-4b60-a667-1f8c724f4679 GAA 181.8 1091 2093 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) b8b2ed2d-5364-4247-b7d7-09ed978cb821 GAA 177.8 1067 2136 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) d965901d-c34c-41e2-a600-2917459bd505 GAA 177.5 1065 4728 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 1983c93c-e46f-454e-bb3a-d5a053fc42ec GAA 177.3 1064 5621 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) ecc85cf8-957c-4a16-b13d-894c41d9f008 GAA 176.7 1060 2943 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 962c8aa5-39de-4060-aa86-9b794a6e3d65 GAA 176.2 1057 2739 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 1c6ddfe2-7aaa-442c-9514-8a9f061b0e78 GAA 176.2 1057 787 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) cd5b99b2-4e89-474b-8bf3-163f8ef3df72 GAA 175.2 1051 2848 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) d73aaad9-7e55-413b-b93c-3b1747b638a4 GAA 174.8 1049 3185 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 9b5dda7d-7d37-4e41-b98c-8d9acdc18d1a GAA 174.5 1047 10615 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 9c3dbeff-9f93-425b-892e-4d5f40b13378 GAA 174.2 1045 3681 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 277ce4ad-ed17-4885-bce8-5b8ee70b1b26 GAA 173.5 1041 1147 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 93ddcd18-78f6-4a47-8031-5d4d42ca976a GAA 173.3 1040 2633 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 88e6a1ac-f776-49d1-a0ad-078ab472a1fa GAA 173.0 1038 1983 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 97edd085-0994-4399-80cc-561c3090f3a0 GAA 157.3 944 14373 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) b2444446-a028-4332-ad57-64875a011177 GAA 153.2 919 8795 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 77809626-4322-4e73-8146-546e0492108f GAA 152.5 915 11167 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) f0e30354-d34e-4529-b9fd-1aa648efe4b7 GAA 151.8 911 6103 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 62b4b3f2-ec9d-4af8-b7c7-a46a396d700f GAA 151.5 909 1938 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) c0fb6793-9542-4a51-9d62-a7641b03de27 GAA 151.3 908 10782 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 8c8ac11f-a91e-4371-848a-e7a525944346 GAA 151.3 908 3694 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 41e0dd98-32f9-40bb-bef5-4995e8ba30fe GAA 151.2 907 3750 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) e7efa79a-33ef-49b6-8ae4-f42e31898029 GAA 151.2 907 1893 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 669e46c2-f87e-4a12-b54a-a83d74bfb47c GAA 150.8 905 1024 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 893a89a1-c7d9-46ea-9a0e-7f02ee3870c2 GAA 150.5 903 13230 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) f2e9714d-54a7-49c1-85ea-2ba58a763c05 GAA 150.3 902 6439 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 9c594897-9034-42a7-a74e-2c42723ec8ea GAA 150.2 901 779 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 11a1d541-7a21-4b8e-9a66-30f9eb28bd8d GAA 150.2 901 4128 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) a6fe89fc-e425-4dbc-8fe9-ac694aff2281 GAA 150.2 901 5341 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) b640b3aa-103a-40b9-97e4-b67018b7a443 GAA 149.7 898 19098 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) b97f7599-f2c6-4420-b14b-c75ec7caf292 GAA 144.7 868 13254 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 59336285-0a15-4d9a-87ab-e4283d753db1 GAA 136.3 818 271 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) f848eb94-acdf-49d7-8cfc-f5b8abc63f26 GAA 61.2 367 8327 + 47.7 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 55a86e01-a02c-4de7-8aa5-d6986aa891c3 GAA 34.2 205 1696 + 47.7 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) d14f109a-46b7-4d40-b826-4beff0f5401b NA NA NA NA NA NA skipped (not_spanning) chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) a4033f85-1556-4f5a-b04f-afbc756855b4 NA NA NA NA NA NA skipped (not_spanning) chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32);47.7(2) 275875eb-c893-4aa1-84e0-14f3991ee694 NA NA NA NA NA NA skipped (not_spanning)

image

readmanchiu commented 6 days ago

I think the 2 outlier reads threw the clustering off. Since you have a pretty deep sequencing depth, If you increase the parameter --min_cluster_size to 5 or something higher than 2 may help get rid of the outliers.

ljohansson commented 5 days ago

Changing the min_cluster_size to 5 indeed got rid of the two outliers. However, the two alleles still got combined into a single allele (see tsv output below). In this case this is important, because one of the two has the benign GAAGGA pattern and the other one the pathogenic GAAGAA pattern. As there are slightly more reads of the GAAGAA pattern (the ~151/302 RU allele), this one represents in the vcf file, but it could have been a GAAGGA read as well if the balance was slightly different.

A second question: in our panel, the coverage is not equally high for all different repeats in the catalogue. Is there also a minimum percentage of usable reads that can be set? A min_cluster_percentage 0.2 for instance? This could be very useful since the lower covered regions might become problematic with too high min_cluster_sizes.

On a side note: interestingly, the two outlier allels have aberrant patterns with the blue allele (from the picture in the previous post) showing a long stretch of GGAGGA (shortly interupted with a few AGGG) and the red one has a AGGGAGGG pattern. I guess these abberant sequences were not counted. Both alleles had a 'short' GAAGAA sequence at the end, which is probably what was counted as the repeat for those sequences..

chrom start end target_repeat locus coverage genotype read_name actual_repeat copy_number size read_start strand allele read_status

chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 540b1f42-ed44-4b60-a667-1f8c724f4679 GAA 181.8 1091 2093 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) b8b2ed2d-5364-4247-b7d7-09ed978cb821 GAA 177.8 1067 2136 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) d965901d-c34c-41e2-a600-2917459bd505 GAA 177.5 1065 4728 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 1983c93c-e46f-454e-bb3a-d5a053fc42ec GAA 177.3 1064 5621 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) ecc85cf8-957c-4a16-b13d-894c41d9f008 GAA 176.7 1060 2943 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 962c8aa5-39de-4060-aa86-9b794a6e3d65 GAA 176.2 1057 2739 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 1c6ddfe2-7aaa-442c-9514-8a9f061b0e78 GAA 176.2 1057 787 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) cd5b99b2-4e89-474b-8bf3-163f8ef3df72 GAA 175.2 1051 2848 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) d73aaad9-7e55-413b-b93c-3b1747b638a4 GAA 174.8 1049 3185 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 9b5dda7d-7d37-4e41-b98c-8d9acdc18d1a GAA 174.5 1047 10615 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 9c3dbeff-9f93-425b-892e-4d5f40b13378 GAA 174.2 1045 3681 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 277ce4ad-ed17-4885-bce8-5b8ee70b1b26 GAA 173.5 1041 1147 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 93ddcd18-78f6-4a47-8031-5d4d42ca976a GAA 173.3 1040 2633 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 88e6a1ac-f776-49d1-a0ad-078ab472a1fa GAA 173.0 1038 1983 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 97edd085-0994-4399-80cc-561c3090f3a0 GAA 157.3 944 14373 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) b2444446-a028-4332-ad57-64875a011177 GAA 153.2 919 8795 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 77809626-4322-4e73-8146-546e0492108f GAA 152.5 915 11167 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) f0e30354-d34e-4529-b9fd-1aa648efe4b7 GAA 151.8 911 6103 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 62b4b3f2-ec9d-4af8-b7c7-a46a396d700f GAA 151.5 909 1938 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) c0fb6793-9542-4a51-9d62-a7641b03de27 GAA 151.3 908 10782 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 8c8ac11f-a91e-4371-848a-e7a525944346 GAA 151.3 908 3694 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 41e0dd98-32f9-40bb-bef5-4995e8ba30fe GAA 151.2 907 3750 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) e7efa79a-33ef-49b6-8ae4-f42e31898029 GAA 151.2 907 1893 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 669e46c2-f87e-4a12-b54a-a83d74bfb47c GAA 150.8 905 1024 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 893a89a1-c7d9-46ea-9a0e-7f02ee3870c2 GAA 150.5 903 13230 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) f2e9714d-54a7-49c1-85ea-2ba58a763c05 GAA 150.3 902 6439 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 9c594897-9034-42a7-a74e-2c42723ec8ea GAA 150.2 901 779 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 11a1d541-7a21-4b8e-9a66-30f9eb28bd8d GAA 150.2 901 4128 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) a6fe89fc-e425-4dbc-8fe9-ac694aff2281 GAA 150.2 901 5341 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) b640b3aa-103a-40b9-97e4-b67018b7a443 GAA 149.7 898 19098 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) b97f7599-f2c6-4420-b14b-c75ec7caf292 GAA 144.7 868 13254 + 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 59336285-0a15-4d9a-87ab-e4283d753db1 GAA 136.3 818 271 - 161.4 full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) f848eb94-acdf-49d7-8cfc-f5b8abc63f26 GAA 61.2 367 8327 + NA full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 55a86e01-a02c-4de7-8aa5-d6986aa891c3 GAA 34.2 205 1696 + NA full chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) d14f109a-46b7-4d40-b826-4beff0f5401b NA NA NA NA NA NA skipped (not_spanning) chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) a4033f85-1556-4f5a-b04f-afbc756855b4 NA NA NA NA NA NA skipped (not_spanning) chr13 102161576 102161726 GAAGGA chr13:102161576-102161726 40 161.4(32) 275875eb-c893-4aa1-84e0-14f3991ee694 NA NA NA NA NA NA skipped (not_spanning)

readmanchiu commented 4 days ago

Increasing the min_cluster_size should screen out the outliers, unfortunately there is a hidden and hard-coded threshold in Straglr that essentially merges clusters that are <= 50 bp apart. I guess I set this back in the days when the reads were way more noisy and I am not confident in reporting alleles with that much (little) size difference. I guess I will have to re-work (or remove) this merging step as clearly clustering works in this case as it should but my post-processing ruined it. Using a fractional min_cluster_sizer is a great suggestion, your point on uneven coverage is well-taken.

An unrelated point ... I don't the any GAAGGA in the actual motif (column 9, after the read name) reported but GAAGGA is reported as the consensus motif column 4. I'm puzzled - I thought this is a bug but you confirmed your actually said the larger allele is GAAGGA whereas the smaller (benign) allele is GAAGAA. Can you confirm this is true?

ljohansson commented 1 day ago

Hi, Thank you for picking up on our remarks. I believe you are on the right track with the FGF14 motifs. The regular allele has a GAA pattern, which is benign if relatively short (<250 GAA RU). If the GAA repeat is longer, then it is considered to be pathogenic. However, there is an alternative GAAGGA pattern for the same repeat that is considered not te be pathogenic, which will probably be counted as half the number of RU, given the longer RU. Coming back to your question. Indeed both alleles have (in general) a different RU (GAAGAA vs GAAGGA), resulting in one pathogenic allele and one benign allele. In our catalogue we give GAA as repeat pattern.

I am not sure if I am oversimplifying things, but possibly the merging of clusters can be dependent on the variation between read lenghts? Here, within both clusters the variation looks relatively low and when taking a Z=3 threshold, there is no overlap. In case of noisier data, the overlap would become apparent, leading to the current situation of merging the clusters. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

cluster1 | average cluster1 | stdev cluster 1 | av-3*stdev -- | -- | -- | -- reads 1-14 | 175,8571 | 2,340517 | 168,8356 cluster 2 | average cluster2 | stdev cluster 2 | av+3*stdev reads15-32 | 150,2333 | 4,183441 | 162,7837

In this context we may have another interesting case of a male patiënt with fragile X. Here, if the CGG repeat is between 44-200 there is a premutation and above 200 it becomes a full mutation. In our case the patient has a range in repeat length from 147 to 397 repeat units, leading to an average of 223 RU, which is pathogenic. I have learnt from a specialist that it is possible such patterns of different lengths can occur between cells because of instability of the repeat. Even though in this case the result of the average cluster size is 'pathogenic', if all reads were e.g. 50 bp shorter, it would fall below the threshold. The range is shown in the vcf, so it can be picked up, but it may be an interesting use-case to consider.