cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121 stars 6 forks source link

Bug in calculating TE match lengths #43

Closed Han-Cao closed 1 month ago

Han-Cao commented 1 month ago

Hi,

I am trying to use the pipeline to annotate my VCF. It seems there is a bug when calculating the span of TE due to 1-based coordinate.

When calculating span/length of TEs, the start and end position of the match on query sequence are extracted from RepeatMasker/OneCode output, which are 1-based. To calculate the length of the match, we need to do end - start + 1.

I currently find that this affects repmask_vcf line 79

paste -d "\t" <(echo -e "${i}") <(grep -w "${i}" merge.bed | awk '{print ($3-$2)}' | paste -sd+ | bc) <(grep -w "${i}" indels.length) | awk '{print $1"\t"$2"\t"$4"\t"($2/$4)}' >> span

and annotate_vcf.R line 64:

mutate(match_len = qry_end - qry_start) %>%

Please correct me if I was wrong.

Moreover, I am also wondering if the "length" refers to the window a TE span on the SV or the length of SV matched by TE.

For example, the output of one SV is:

# RepeatMasker output
  315   19.2  0.0  0.0          -10001037-10001039.INS.1           2      53   (2965)   +       FRAM    SINE/Alu          115   166    (10)     1  
  385   30.1  1.7  1.5          -10001037-10001039.INS.1         145     194   (2824)   +       SVA_A   Retroposon/SVA     68   225  (1162)     2  
  377   24.0  0.0  0.0          -10001037-10001039.INS.1         195     269   (2749)   C       7SLRNA  srpRNA          (241)    79       5     3  
  385   30.1  1.7  1.5          -10001037-10001039.INS.1         270     294   (2724)   +       SVA_A   Retroposon/SVA    226   312  (1075)     2  
  452   24.4  8.2  1.9          -10001037-10001039.INS.1         310     455   (2563)   +       MIR     SINE/MIR            2   156   (106)     4  
 1312   12.1  0.0  0.0          -10001037-10001039.INS.1         483     655   (2363)   C       AluSg   SINE/Alu         (11)   299     127     5  
   16    0.0  0.0  0.0          -10001037-10001039.INS.1         659     675   (2343)   +       (A)n    Simple_repeat       1    17     (0)     6  
  272   28.2  7.3  4.0          -10001037-10001039.INS.1         690     785   (2233)   C       MLT1J   LTR/ERVL-MaLR   (413)    99       1     7  
  787    7.1  0.0  0.0          -10001037-10001039.INS.1         972    1070   (1948)   C       HY3     scRNA             (2)    99       1     8  
  451   22.6  3.6 10.8          -10001037-10001039.INS.1        1160    1298   (1720)   +       SVA_A   Retroposon/SVA    189   318  (1069)     2  
  353   12.5  1.1 18.7          -10001037-10001039.INS.1        1221    1308   (1710)   C       AluJr   SINE/Alu        (232)    80       6     9 *
  441   28.4 15.1  1.1          -10001037-10001039.INS.1        1309    1658   (1360)   +       L2c     LINE/L2          2989  3387     (0)    10  
  399   28.4  0.0  0.0          -10001037-10001039.INS.1        1734    1835   (1183)   C       SVA_A   Retroposon/SVA (1087)   300     199    11  
 2123   14.5  0.0  0.3          -10001037-10001039.INS.1        1861    2164    (854)   C       AluJo   SINE/Alu          (7)   305       3    12  
 2304    9.2  0.3  0.0          -10001037-10001039.INS.1        2171    2462    (556)   C       AluSg   SINE/Alu         (17)   293       1    13  
  503   26.4  0.8  0.8          -10001037-10001039.INS.1        2495    2620    (398)   +       SVA_A   Retroposon/SVA    209   334  (1053)    14  
  207   11.4 15.9  0.0          -10001037-10001039.INS.1        2869    2912    (106)   +       MIR     SINE/MIR          211   261     (1)    15  
  476   19.2  0.0  0.0          -10001037-10001039.INS.1        2936    3013      (5)   +       FRAM    SINE/Alu            1    78    (98)    16  

# OneCode output
315     19.2    0.0     0.0     -10001037-10001039.INS.1        2       53      52      +       FRAM    SINE/Alu        115     166     (10)    1       1
385/385/451/503 26.484  2.435   5.464   -10001037-10001039.INS.1        145     2620    267     +       SVA_A   Retroposon/SVA  68      334     (1053)  2/2/2/14        4
452     24.4    8.2     1.9     -10001037-10001039.INS.1        310     455     155     +       MIR     SINE/MIR        2       156     (106)   4       1
1312    12.1    0.0     0.0     -10001037-10001039.INS.1        483     655     173     C       AluSg   SINE/Alu        (11)    299     127     5       1
272     28.2    7.3     4.0     -10001037-10001039.INS.1        690     785     99      C       MLT1J   LTR/ERVL-MaLR   (413)   99      1       7       1
353     12.5    1.1     18.7    -10001037-10001039.INS.1        1221    1308    75      C       AluJr   SINE/Alu        (232)   80      6       9       1
441     28.4    15.1    1.1     -10001037-10001039.INS.1        1309    1658    399     +       L2c     LINE/L2 2989    3387    (0)     10      1
399     28.4    0.0     0.0     -10001037-10001039.INS.1        1734    1835    102     C       SVA_A   Retroposon/SVA  (1087)  300     199     11      1
2123    14.5    0.0     0.3     -10001037-10001039.INS.1        1861    2164    303     C       AluJo   SINE/Alu        (7)     305     3       12      1
2304    9.2     0.3     0.0     -10001037-10001039.INS.1        2171    2462    293     C       AluSg   SINE/Alu        (17)    293     1       13      1
207     11.4    15.9    0.0     -10001037-10001039.INS.1        2869    2912    51      +       MIR     SINE/MIR        211     261     (1)     15      1
476     19.2    0.0     0.0     -10001037-10001039.INS.1        2936    3013    78      +       FRAM    SINE/Alu        1       78      (98)    16      1

The SVA_A are mapped to multiple regions of the SV, starting from 145 to 2620. The aligned TE's length is much less than the region between start and end.

Thanks a lot.

clemgoub commented 1 month ago

Hello @Han-Cao,

Yes you are right, this looks like a bug. I agree with you that the coordinates should be corrected as you suggest. I will be fixing it promptly and will let you know here.

Regarding how hit length and overall TE span on SV is calculated:

Let me know if you need further clarification! I will be fixing this bug ASAP.

Cheers,

Clément

clemgoub commented 1 month ago

Dear @Han-Cao,

I have fixed the uses withRepeatMasker's 1-based coordinates so correct lengths can be calculated. Thanks again for pointing out this important point.

Commit: https://github.com/cgroza/GraffiTE/commit/bd2f1dea99f2a3e8b6e70d0d9d8b3953bf0e44a6

In repmask_vcf l.73:

awk 'NR > 3 {print $5"\t"$6-1"\t"$7"\t"$10}' ${REPMASK_ONECODE_OUT} | bedtools merge > merge.bed # add -1 to start to meet .bed format

And in annotate_vcf.R l.24:

qry_start = as.integer(qry_start)-1, # add -1 to be 0-based and calculate length

This should have the same effect than your suggested modifications, but correcting the coordinates earlier in the script. Let me know if this is working as you expect and if further bug persist. Thanks again for your feedback,

Cheers,

Clément

Han-Cao commented 1 month ago

Dear Clément,

Thanks for your quick reply!

Could you clarify more on the TE span of assembled TEs? I am wondering whether it may overestimate the span of the TEs assembled by OneCode. In the above example INS, the following 4 regions aligned by SVA_A is assembled into 1 record:

# RepeatMasker output
          385   30.1    1.7      1.5    -10001037-10001039.INS.1         145     194   (2824)   +       SVA_A   Retroposon/SVA     68   225  (1162)     2  
          385   30.1    1.7      1.5    -10001037-10001039.INS.1         270     294   (2724)   +       SVA_A   Retroposon/SVA    226   312  (1075)     2  
          451   22.6    3.6     10.8    -10001037-10001039.INS.1        1160    1298   (1720)   +       SVA_A   Retroposon/SVA    189   318  (1069)     2   
          503   26.4    0.8      0.8    -10001037-10001039.INS.1        2495    2620    (398)   +       SVA_A   Retroposon/SVA    209   334  (1053)    14  

# OneCode output
385/385/451/503 26.484  2.435   5.464   -10001037-10001039.INS.1        145     2620    267     +       SVA_A   Retroposon/SVA    68    334  (1053)  2/2/2/14      4

For individual matched region, the sum of their q_end - q_start + 1 is 340, but this value for the assembled record is 2476, which is much larger the region matched by SVA_A (340), the length of SVA_A aligned to the SV (267, 8th column in OneCode output), and even larger than SVA_A itself (1387 on Dfam 3.8). Is this behavior by design?

Regarding the bug in bed format, the fix works well. I found a similar 1-based coordinate in SVA VNTR part (repmask_vcf.sh line 107):

awk '{if ($15 == "+") {print $16"\t"$18"\t"$19"\t"$1} else {print $16"\t"$20"\t"$19"\t"$1}}' > SVA_candidates.bed

Could you also kindly confirm if the SVA_VNTR.bed is 1-based or 0-based?

echo -e "SVA_A\t436\t855\t+" >> SVA_VNTR.bed
echo -e "SVA_B\t431\t867\t+" >> SVA_VNTR.bed
echo -e "SVA_C\t432\t851\t+" >> SVA_VNTR.bed
echo -e "SVA_D\t432\t689\t+" >> SVA_VNTR.bed
echo -e "SVA_E\t428\t864\t+" >> SVA_VNTR.bed
echo -e "SVA_F\t435\t857\t+" >> SVA_VNTR.bed

Besides, when I run the repmask_vcf.sh on my VCF, I found some code runs very slow due to repetitively grep files. This can significantly slow the annotation when there are a lot of SVs. I made some modifications, and it gave the same output on my data. I will open a PR for your review. It also changes SVA_candidates.bed to 0-based.

Thank you!

Best, Han

clemgoub commented 1 month ago

Dear @Han-Cao,

Thanks for bringing this to my attention. Indeed, what you see is the "expected" behavior, but in this particular case, I think it highlight a challenge that needs to be taken care of.

Based on the RepeatMasker hit IDs, I think that this SV has received other TE annotation right? (let me know if I'm wrong here!). At least there should be some hits between the 2/2/2 and the 14. Before talking about the behavior of OneCode on this, I would say that if indeed there are more hits in the SV to different TEs, it is highly unlikely (especially for human data) that this SV is a canonical pME (i.e.: real product of transposition). Rather, I would think this is an SV that happen to contain >= 80% of TEs. So at the end of the day, and specifically for pME analysis, I would filter out this variant, as shown in the paper (benchmark + filters).

However, I agree with you that the output is misleading. OneCode merge hits based on co-linearity of the Repeatmasker hits (and some heuristics based on the consensus length I think). In that case, it would say that a < 300bp region of the SVA maps over a 2476 bp of the SV. The only way this would be realistic is if the variant was purely a VNTR. VNTR can expand and thus an actual SVA hit may very well be longer than the consensus. However in that case, these ~260 bp maps at the beginning of the SVA. So in that regards, the result of OneCode seem to be problematic.

Nevertheless, the total TE span is not calculated by summing all the OneCode hit length, but by merging overlapping hits into the least amount of bed record and and taking the total of base-pairs covered (i.e. sum of bases in the SV with at least 1 hit).

I think that it might be worth adding the option to bypass OneCode. That would require a little bit of time on my end, but this is feasible for a future update. In the meantime, if your purpose is to keep pME candidate, I would use a filter that only keep variants with nhits=1;.

Regarding your second questions, you are correct the SVA regions are 1-based and I have accepted your pull request which takes care of it. Thank you!

Let me know what you think, and thanks again for the feedback!

Cheers,

Clément

Han-Cao commented 1 month ago

Dear Clément,

Yes, you are right. That insertion is annotated by a lot of TEs. I was concerning about that would overestimate the region covered by assembled TEs like that SVA.

However, I totally agree that such SVs are too complicated to analysis, and we should focus on nhits=1; (maybe also TwP with nhits=2;?). So, the "misleading" annotation will not affect the downstream analysis.

Thank you so much for your patience and clarification!

Best, Han

# RepeatMasker output
  315   19.2  0.0  0.0          -10001037-10001039.INS.1           2      53   (2965)   +       FRAM    SINE/Alu          115   166    (10)     1  
  385   30.1  1.7  1.5          -10001037-10001039.INS.1         145     194   (2824)   +       SVA_A   Retroposon/SVA     68   225  (1162)     2  
  377   24.0  0.0  0.0          -10001037-10001039.INS.1         195     269   (2749)   C       7SLRNA  srpRNA          (241)    79       5     3  
  385   30.1  1.7  1.5          -10001037-10001039.INS.1         270     294   (2724)   +       SVA_A   Retroposon/SVA    226   312  (1075)     2  
  452   24.4  8.2  1.9          -10001037-10001039.INS.1         310     455   (2563)   +       MIR     SINE/MIR            2   156   (106)     4  
 1312   12.1  0.0  0.0          -10001037-10001039.INS.1         483     655   (2363)   C       AluSg   SINE/Alu         (11)   299     127     5  
   16    0.0  0.0  0.0          -10001037-10001039.INS.1         659     675   (2343)   +       (A)n    Simple_repeat       1    17     (0)     6  
  272   28.2  7.3  4.0          -10001037-10001039.INS.1         690     785   (2233)   C       MLT1J   LTR/ERVL-MaLR   (413)    99       1     7  
  787    7.1  0.0  0.0          -10001037-10001039.INS.1         972    1070   (1948)   C       HY3     scRNA             (2)    99       1     8  
  451   22.6  3.6 10.8          -10001037-10001039.INS.1        1160    1298   (1720)   +       SVA_A   Retroposon/SVA    189   318  (1069)     2  
  353   12.5  1.1 18.7          -10001037-10001039.INS.1        1221    1308   (1710)   C       AluJr   SINE/Alu        (232)    80       6     9 *
  441   28.4 15.1  1.1          -10001037-10001039.INS.1        1309    1658   (1360)   +       L2c     LINE/L2          2989  3387     (0)    10  
  399   28.4  0.0  0.0          -10001037-10001039.INS.1        1734    1835   (1183)   C       SVA_A   Retroposon/SVA (1087)   300     199    11  
 2123   14.5  0.0  0.3          -10001037-10001039.INS.1        1861    2164    (854)   C       AluJo   SINE/Alu          (7)   305       3    12  
 2304    9.2  0.3  0.0          -10001037-10001039.INS.1        2171    2462    (556)   C       AluSg   SINE/Alu         (17)   293       1    13  
  503   26.4  0.8  0.8          -10001037-10001039.INS.1        2495    2620    (398)   +       SVA_A   Retroposon/SVA    209   334  (1053)    14  
  207   11.4 15.9  0.0          -10001037-10001039.INS.1        2869    2912    (106)   +       MIR     SINE/MIR          211   261     (1)    15  
  476   19.2  0.0  0.0          -10001037-10001039.INS.1        2936    3013      (5)   +       FRAM    SINE/Alu            1    78    (98)    16  
clemgoub commented 1 month ago

Hi Han,

You are absolutely right; for the Twin Primed L1, I keep 2hits + indeed. Here's the filter I used with R(you can find all the scripts here)

library(vcfR)
HPRCvcf<-read.vcfR("HPRC_pangenome.vcf")
Lfilter<-abs(as.numeric(extract.info(HPRCvcf, "SVLEN"))) >= 250
HitsFilter<-as.numeric(extract.info(HPRCvcf, "n_hits")) == 1 | extract.info(HPRCvcf, "mam_filter_1") != "None"
SVAVNTF_filter<-extract.info(HPRCvcf, "mam_filter_2") == "None"
TEclassFilter<-grepl(paste(c("Alu", "L1", "SVA"), collapse = "|"), extract.info(HPRCvcf, "repeat_ids"))
HPRCvcf_F<-HPRCvcf[Lfilter & HitsFilter & TEclassFilter & SVAVNTF_filter]

I'll close this issue for now, but don't hesitate to reach out if you have more questions/suggestion, we deeply appreciate it!

Cheers,

Clément