EHDN truncated output - Githubissues

jonnyelse commented 3 years ago

Hi, I have been trying to use expansion hunter denovo to analysis some WGSs in cram format and I am getting motif/locus output files that appear truncated. For instance both my motif and locus files only contain repeat units that start with the nucleotide ‘A’. I am wondering if this is a problem you have ever encounter before and if you have any advice for debugging this problem? I have tried running the program on a more powerful cluster and also experimented with converting the files from cram to bam and trying different MAPQ flag values but this has had little effect. The program is outputting no error messages and appears to be running to completion. I have also run the same analysis on the original expansion hunter software using your recently released genome wide variant catalog, which is producing more reasonalbe results and detecting lots of repeats starting with letters other than 'A'. Below is an example of the output motif file for chr22 from one of my sample (I have changed some of the numbers to protect donor confidentiality). The file only has 24 repeats all starting with 'A'. Thanks very much for any help you can provide.

motif	num_paired_irrs	norm_num_paired_irrs
AAAATACACCACACAC	2	2.11
AAAGGGAGGAGAGAGAG	2	1.11
AAAGGGAGGAGAGAGAGAG	3	3.44
AACCCACTCCCAAGATAACG	1	2.11
AACCCACTCCCAAGATAACT	3	1.22
AACCCT	67	76.77
AACCTCCATAACCTCCCT	14	14.45
AACCTCCCT	9	11.01
AACGTACACACAGGGG	1	1.22
AAGGAGGG	1	1.11
AATGG	10	13.37
AATGGAATGGAATGGAGTGG	2	2.33
ACACACACGCCCCGC	1	1.11
ACAGCGTCCTCACCGCCCCC	2	2.22
ACAGCGTCCTCACCGCCCCG	1	1.11
ACAGCGTCCTCACCGCGTCC	2	2.22
ACAGCGTCCTCACCTCCCCG	2	3.33
ACAGCGTCCTCATCGCGTCC	1	1.11
ACC	1	2.22
AG	2	1.33
AGAGGAGGAGG	1	2.11
AGAGGGATGG	1	1.11
AGGAGGAGGCGGCGG	1	2.22
ATCC	6	8.88

egor-dolzhenko commented 3 years ago

Thanks for the question!

The repeat units reported by EHDN always begin with As and Cs. This is because the program does not determine the exact reference orientation of the repeat and all repeat units that are equivalent up to permutation of symbols or reverse complement are considered equivalent. For example, these repeats are considered equivalent:

CAGCAGCAGCAGCAG – repeat unit CAG
AGCAGCAGCAGCAGC – repeat unit AGC
GCAGCAGCAGCAGCA – repeat unit GCA
CTGCTGCTGCTGCTG – repeat unit CTG (reverse complement if CAG)
TGCTGCTGCTGCTGC – repeat unit TGC
GCTGCTGCTGCTGCT – repeat unit GCT

So EHDN considers CAG, AGC, GCA, CTG, TGC, GCT to be equivalent. It picks a single representative repeat unit among all of these: it chooses the smallest unit in lexicographical order (in this case it is AGC). This is why all units it reports begin with either A or C.

Could you please confirm that you don't have any motifs starting with the letter C in the full output of your program? If not, what kind of data are you working with? Do you know if your data is PCR-free or PCR+?

jonnyelse commented 3 years ago

Hi Egor, Thanks so much for your reply. Looking over several motif files i can see very few motifs starting with C, maybe less than 1 in 100. I believe the sequences were generated by PCR+, do you think that could be the cause? FYI I have also looked over some more locus files and they do seem to have some that start with C, so I am seeing the lack of C motifs much more in the motif files rather than the locus files. Thanks

egor-dolzhenko commented 3 years ago

Yes, this makes sense. Motifs starting with a C are 100% GC (since if motif contains an A or T then there is an equivalent motif starting with an A as outlined above). And long, 100% GC repeats (such as those reported in the motif files) may not amplify well in PCR+ data.

Illumina / ExpansionHunterDenovo

EHDN truncated output #47