Some Question about EHdn..

a7420174 commented 3 years ago

Hello!

To detect ASD-related tandem repeat loci in my data, I'm refering to Trost et al. (2020). While I run an outlier analysis in EHdn, I have some questions.

I knew that depth-normalized counts are calculated as the raw count * 40 / the average read depth of the sample and the normalized counts are used to filter tandem repeat regions in Trost et al. (counts ≥ 2). I could understand the normalization of the counts by dividing the avg counts because of comparison between samples, but I had little understanding of the numerator. Is there any reason for setting a specific value (40)?
I got the counts of anchored IIRs in our cohort and an another cohort(1KGP) from EHdn. After filtering out some tandem repeats where too many samples have no count, I ran PCA of the count of anchored IIRs to check any ancestry or batch effect.

Here is the PCA result. It seems that they are split into our cohort(ASD; Korean) and 1KG samples. I checked the versions of gatk, bwa-mem, etc., but they were similar. Is there anything else to consider? Or maybe the count of anchored IIRs is not suitable for PCA?

I viewed tandem repeats manually from IGV to filter out false positives in the case of rare tandem repeats. Strangely, I found that the raw count of some anchored IIRs in EHdn profile seems different from the number of the reads in IGV.

igv_ACTCCCACTGCCTCCCT_2

For example, this image is an IGV screenshot showing one anchored IIR checked.

contig  start   end     motif   num_anc_irrs    norm_num_anc_irrs       het_str_size
chr4    143052786       143052787       ACTCCCACTGCCTCCCT       2       1.53    9

And here is the EHdn result corresponding to the read above. I think two reads should appear in IGV. I wonder if I misread the result.

Thanks in advance.

egor-dolzhenko commented 3 years ago

Thanks for the questions Jaehyun!

No reason to use 40 other than convenience. We are essentially normalizing depth to 40x, which is pretty common for WGS data (if the actual depth is 40x than all IRR counts are unchanged after normalization.)
Hmm.. Do you know what sequencing instrument / library prep was used to generate your data? One way to check if your results are due to batch effects is to perform a complementary analysis by running (regular) ExpansionHunter on a genome-wide repeat catalog that we just released and then run your PCA analysis on the resulting repeat genotypes.
Unfortunately IGV is not always suitable for confirming EHdn results because IRRs often misalign or don't get aligned at all. We have developed a dedicated read visualization tool for (regular) EH called REViewer. In the future, we will try to extend support to EHdn as well.

Best wishes, Egor

a7420174 commented 3 years ago

Oh I found out the library prep of our cohort (pcr-amplified) was different from the one of 1KGP (pcr-free). And Hiseq X and Novaseq 6000 were used in our cohort and 1KGP separately. Do you think this makes a difference? I learned Trost et al. (2020) also used pcr-based samples. Are there any methods to correct this batch effect..? Plus, is it not recommended to run ExpansionHunter for PCR-amplified WGS samples?

Thanks, Jaehyun

egor-dolzhenko commented 3 years ago

Hi Jaehyun. Apologies for a late reply.

Yes, comparing PCR-amplified (PCR+) and PCR-free (PCR-) data can be challenging because of the amplification biases in PCR+ data. I think restricting your analysis to PCR+ data only may be a good option. We will try to implement better amplification bias detection / correction in the future versions of EHdn that might make PCR+ vs PCR- comparisons possible.

And yes, we don't officially support using ExpansionHunter with PCR+ data. However, multiple research groups analyzed PCR+ data with ExpansionHunter and obtained good results. (By performing additional analysis of the STR genotypes reported by ExpansionHunter to assess their accuracy.) I think it could still be useful to analyze the genome-wide STR catalog in some of your samples using ExpansionHunter and then inspect the results with REViewer.

a7420174 commented 3 years ago

Thanks for your reply! It helps a lot.

bjtrost commented 3 years ago

Hi Jaehyun,

I just wanted to clarify about the use/non-use of samples with PCR+ library prep in Trost et al 2020. We did some initial QC comparing different sequencing platforms and library prep methods (Extended Data Fig. 2a), and based on these results, we excluded PCR+ samples from the main statistical analysis.

Cheers,

Brett

a7420174 commented 3 years ago

Appreciate for your feedback, Brett.

Can I ask you one more question? I think checking the size of tandem repeats unique in our ASD cohort is meaningful. So, in addition to a genome-wide repeat catalog you recommended, I have a plan to use a custom catalog based on my EHdn outputs. I'm making a catalog using Tandem Repeat Finder outputs to match loci in EHdn str_profile with reference genome tandem repeat coordinates within 500bp because the coordinates in EHdn, I understand, represent the location of anchored IIRs. Is It okay to make a catalog like this? And it seems that EHdn outputs motifs regardless of the reference genome orientation. Is it right? then how did you address this issue in Trost et al. (2020)?

Thanks in advance. JaeHyun

bjtrost commented 3 years ago

Hi JaeHyun,

I am not completely sure I understand your question, but let me try to answer. Are you wanting to take loci detected by EHdn and genotype them using ExpansionHunter? If so, then using in your variant-catalog file the Tandem Repeats Finder coordinates that correspond to the EHdn coordinates is the right thing to do. Regarding the motif, EHdn reports the canonicalized motif (the lowest-alphabetically representation of the motif under shift and reverse-complement operations), but for the ExpansionHunter variant-catalog, you will need to use the version of the motif that is in reference orientation. I believe that Tandem Repeats Finder always gives the motif in reference orientation, but please double-check this.

Please let me know if this answers your question.

Cheers,

Brett

a7420174 commented 3 years ago

Thank you for quick reply!

Yes, you got it. But owing to my poor computing skills, I have some difficulty matching with motif under shift or reverse-complement operations.. Please can you share codes or recommend python package about this? It'll really helpful for me.

Thanks, JaeHyun

bjtrost commented 3 years ago

Hi,

Here is the Python code to do that. The ComputeCanonicalRepeatUnit function takes as input the sequence of a motif and returns the canonicalized version of the motif. For example, ComputeCanonicalRepeatUnit("CTG") == "AGC". Full disclosure, to derive this code I merely translated the equivalent routines in Egor's EHdn C++ code to Python. :)

Cheers,

Brett

def MinimialUnitUnderShift(unit):
    minimal_unit = unit
    double_unit = unit + unit
    for index in range(0, len(unit)):
        current_unit = double_unit[index:index+len(unit)]
        if current_unit < minimal_unit:
            minimal_unit = current_unit
    return minimal_unit

def ComputeCanonicalRepeatUnit(unit):
    minimal_unit = MinimialUnitUnderShift(unit)
    unit_rc = reverse_complement(unit)
    minimal_unit_rc = MinimialUnitUnderShift(unit_rc)

    if minimal_unit_rc < minimal_unit:
        return minimal_unit_rc
    return minimal_unit

a7420174 commented 3 years ago

Thank you so much! It would be a great help.

a7420174 commented 3 years ago

Hi,

I have one more question about the tandem repeats not matched with TRF coordinates. Here is a plot for the distribution of motif length. I matched loci in EHdn str_profile with reference genome tandem repeat coordinates within 1,000bp.

7868413a-3106-49dd-99d3-25dbc3f3c644

There are many VNTRs(longer than 7bp) not matched with TRF coordinates, and I understood that's possible because of the diversity of VNTR sequence composition. But the thing is the repeats of which motif length is 6bp. The percentage of AACCCT is ,surprisingly, 92% in 6bp motifs. But the percentage of AACCCT is not that high in TRF. I knew that these repeats consist of telomeres. I thought maybe the repeats in EHdn results are false positives. What do you think about it? And how did you estimate the size of the repeats not matched with TRF?

Again, thank you for taking the time to help. JaeHyun

egor-dolzhenko commented 2 years ago

Hi JaeHyun,

Sorry for the slow replies. This looks like an interesting observation that would be useful to follow up on.

We are currently working on a tool for automated assessment of EHdn results. I think this tool will be the perfect way to assess your results. Would you be interested in testing out the prototype of this tool? If yes, an early prototype should be ready by late November / early December.

Best wishes, Egor

a7420174 commented 2 years ago

Yes. It would be great if I take part in. I look forward to testing your tool!

Thanks, JaeHyun

Illumina / ExpansionHunterDenovo

Some Question about EHdn.. #46