I am currently working on comparing different STR software to see how well they make the calls using ONT's latest Q20+ data for HG002.
https://labs.epi2me.io/giab-2023.05/
I noticed that while NanoRepeat works quite well in most cases, it often just calls one STR allele when the two STR counts of a heterozygous call only differ by one or two repeats. For example, for the GIPC2 CCG repeat
$ more pass.cram.chr19-14496041-14496074-CCG.summary.txt
Summary_file=pass.cram.chr19-14496041-14496074-CCG.summary.txt Repeat_Region=chr19-14496041-14496074-CCG Method=GMM Num_Alleles=1 Num_Removed_Reads=0 Allele1_Num_Reads=22 Allele1_Repeat_Size=12
$ more pass.cram.chr19-14496041-14496074-CCG.repeat_size.txt
While it can be seen that the call should be 10/12 but I suppose NanoRepeat plays it safe and call 12/12 most of the time.
I think it is possible to resolve quite many of these calls if the reads are haplotagged.
haplotagged bam can be created by running "whatshap haplotag". The idea is to use heterozygous sites to assign a read to one of the two possible haplotypes in a chromosome. whatshap add a HP tag to an aligned read to be either 1 or 2. In the GIPC2 example above, you can see clearly 10 and 12 are assigned to different haplotypes.
Therefore, I think it would be great if NanoRepeat can also support haplotagged bam such that it can make better calls in these situations.
I am currently working on comparing different STR software to see how well they make the calls using ONT's latest Q20+ data for HG002. https://labs.epi2me.io/giab-2023.05/
I am using HG002 T2T assembly v0.7 as well as using IGV to establish a truth set. https://github.com/marbl/HG002
I noticed that while NanoRepeat works quite well in most cases, it often just calls one STR allele when the two STR counts of a heterozygous call only differ by one or two repeats. For example, for the GIPC2 CCG repeat
$ more pass.cram.chr19-14496041-14496074-CCG.summary.txt Summary_file=pass.cram.chr19-14496041-14496074-CCG.summary.txt Repeat_Region=chr19-14496041-14496074-CCG Method=GMM Num_Alleles=1 Num_Removed_Reads=0 Allele1_Num_Reads=22 Allele1_Repeat_Size=12 $ more pass.cram.chr19-14496041-14496074-CCG.repeat_size.txt
Repeat_Region=chr19-14496041-14496074-CCG
Read_Name Repeat_Size
5a621e4a-464b-4397-bb52-dd6868344ed8 12.0 32c058b8-463a-4963-98de-ed151215c897 12.0 5e640f40-af78-4c37-b75f-2a489e6d287d 10.0 07ba1eca-eed6-41b7-9657-041cf9478b30 10.0 09feeb1a-bacc-43a5-9246-51adbec0d4ad 12.0 f7e95eed-e085-480e-badf-2ff3dafb75c6 12.0 ce13d292-e06c-4216-88e2-01900709ffb4 12.0 f2c19bf6-8875-45c6-bfef-c1cc49c74e82 12.0 7993c22d-00f7-4da4-a39e-dd7836d214f4 10.0 36b5fa47-1ae5-4c95-9d7b-2f4209d2ce57 12.0 1e96d19d-0281-4225-9395-bf9a001d1f18 10.0 d2a355a4-e30e-4f83-96b6-731467528d6c 10.0 be856e49-b1dd-4b2a-bc05-d129a5dfbff9 10.0 622e9d4c-3ab6-4560-9396-65cb9bd73696 10.0 1daea057-60bb-449a-8027-26a8e536909b 10.0 8505aa3e-d5b3-4544-b98c-e492122ccd27 12.0 1c702495-d4f0-4f41-9622-fb9555950713 12.0 1655f006-34f8-43ed-8800-eef18aa9a22e 12.0 4dc8881f-bf77-4c2e-94c7-fcb6955872c5 12.0 b8c0f0d9-377e-4f31-9440-46a2926d4ed4 12.0 c0b6597d-e401-4c5d-8eec-a7e49a43dc39 11.0 8646a71c-6f96-49c3-8250-e20186acfbdf 10.0
While it can be seen that the call should be 10/12 but I suppose NanoRepeat plays it safe and call 12/12 most of the time.
I think it is possible to resolve quite many of these calls if the reads are haplotagged.
haplotagged bam can be created by running "whatshap haplotag". The idea is to use heterozygous sites to assign a read to one of the two possible haplotypes in a chromosome. whatshap add a HP tag to an aligned read to be either 1 or 2. In the GIPC2 example above, you can see clearly 10 and 12 are assigned to different haplotypes.
Therefore, I think it would be great if NanoRepeat can also support haplotagged bam such that it can make better calls in these situations.