Closed Han-Cao closed 1 month ago
Hello @Han-Cao,
Yes you are right, this looks like a bug. I agree with you that the coordinates should be corrected as you suggest. I will be fixing it promptly and will let you know here.
Regarding how hit length and overall TE span on SV is calculated:
.out
file) is converted to bed (of course with -1 difference that you noticed) such as the "length" is the total bp on the SV covered by a given TE hit.bedtool merge
. This way, the total TE span cannot be longer than the SV.Let me know if you need further clarification! I will be fixing this bug ASAP.
Cheers,
Clément
Dear @Han-Cao,
I have fixed the uses withRepeatMasker's 1-based coordinates so correct lengths can be calculated. Thanks again for pointing out this important point.
Commit: https://github.com/cgroza/GraffiTE/commit/bd2f1dea99f2a3e8b6e70d0d9d8b3953bf0e44a6
In repmask_vcf l.73:
awk 'NR > 3 {print $5"\t"$6-1"\t"$7"\t"$10}' ${REPMASK_ONECODE_OUT} | bedtools merge > merge.bed # add -1 to start to meet .bed format
And in annotate_vcf.R l.24:
qry_start = as.integer(qry_start)-1, # add -1 to be 0-based and calculate length
This should have the same effect than your suggested modifications, but correcting the coordinates earlier in the script. Let me know if this is working as you expect and if further bug persist. Thanks again for your feedback,
Cheers,
Clément
Dear Clément,
Thanks for your quick reply!
Could you clarify more on the TE span of assembled TEs? I am wondering whether it may overestimate the span of the TEs assembled by OneCode. In the above example INS, the following 4 regions aligned by SVA_A is assembled into 1 record:
# RepeatMasker output
385 30.1 1.7 1.5 -10001037-10001039.INS.1 145 194 (2824) + SVA_A Retroposon/SVA 68 225 (1162) 2
385 30.1 1.7 1.5 -10001037-10001039.INS.1 270 294 (2724) + SVA_A Retroposon/SVA 226 312 (1075) 2
451 22.6 3.6 10.8 -10001037-10001039.INS.1 1160 1298 (1720) + SVA_A Retroposon/SVA 189 318 (1069) 2
503 26.4 0.8 0.8 -10001037-10001039.INS.1 2495 2620 (398) + SVA_A Retroposon/SVA 209 334 (1053) 14
# OneCode output
385/385/451/503 26.484 2.435 5.464 -10001037-10001039.INS.1 145 2620 267 + SVA_A Retroposon/SVA 68 334 (1053) 2/2/2/14 4
For individual matched region, the sum of their q_end - q_start + 1
is 340, but this value for the assembled record is 2476, which is much larger the region matched by SVA_A (340), the length of SVA_A aligned to the SV (267, 8th column in OneCode output), and even larger than SVA_A itself (1387 on Dfam 3.8). Is this behavior by design?
Regarding the bug in bed format, the fix works well. I found a similar 1-based coordinate in SVA VNTR part (repmask_vcf.sh line 107):
awk '{if ($15 == "+") {print $16"\t"$18"\t"$19"\t"$1} else {print $16"\t"$20"\t"$19"\t"$1}}' > SVA_candidates.bed
Could you also kindly confirm if the SVA_VNTR.bed
is 1-based or 0-based?
echo -e "SVA_A\t436\t855\t+" >> SVA_VNTR.bed
echo -e "SVA_B\t431\t867\t+" >> SVA_VNTR.bed
echo -e "SVA_C\t432\t851\t+" >> SVA_VNTR.bed
echo -e "SVA_D\t432\t689\t+" >> SVA_VNTR.bed
echo -e "SVA_E\t428\t864\t+" >> SVA_VNTR.bed
echo -e "SVA_F\t435\t857\t+" >> SVA_VNTR.bed
Besides, when I run the repmask_vcf.sh
on my VCF, I found some code runs very slow due to repetitively grep
files. This can significantly slow the annotation when there are a lot of SVs. I made some modifications, and it gave the same output on my data. I will open a PR for your review. It also changes SVA_candidates.bed
to 0-based.
Thank you!
Best, Han
Dear @Han-Cao,
Thanks for bringing this to my attention. Indeed, what you see is the "expected" behavior, but in this particular case, I think it highlight a challenge that needs to be taken care of.
Based on the RepeatMasker hit IDs, I think that this SV has received other TE annotation right? (let me know if I'm wrong here!). At least there should be some hits between the 2/2/2 and the 14. Before talking about the behavior of OneCode on this, I would say that if indeed there are more hits in the SV to different TEs, it is highly unlikely (especially for human data) that this SV is a canonical pME (i.e.: real product of transposition). Rather, I would think this is an SV that happen to contain >= 80% of TEs. So at the end of the day, and specifically for pME analysis, I would filter out this variant, as shown in the paper (benchmark + filters).
However, I agree with you that the output is misleading. OneCode merge hits based on co-linearity of the Repeatmasker hits (and some heuristics based on the consensus length I think). In that case, it would say that a < 300bp region of the SVA maps over a 2476 bp of the SV. The only way this would be realistic is if the variant was purely a VNTR. VNTR can expand and thus an actual SVA hit may very well be longer than the consensus. However in that case, these ~260 bp maps at the beginning of the SVA. So in that regards, the result of OneCode seem to be problematic.
Nevertheless, the total TE span is not calculated by summing all the OneCode hit length, but by merging overlapping hits into the least amount of bed record and and taking the total of base-pairs covered (i.e. sum of bases in the SV with at least 1 hit).
I think that it might be worth adding the option to bypass OneCode. That would require a little bit of time on my end, but this is feasible for a future update. In the meantime, if your purpose is to keep pME candidate, I would use a filter that only keep variants with nhits=1;
.
Regarding your second questions, you are correct the SVA regions are 1-based and I have accepted your pull request which takes care of it. Thank you!
Let me know what you think, and thanks again for the feedback!
Cheers,
Clément
Dear Clément,
Yes, you are right. That insertion is annotated by a lot of TEs. I was concerning about that would overestimate the region covered by assembled TEs like that SVA.
However, I totally agree that such SVs are too complicated to analysis, and we should focus on nhits=1;
(maybe also TwP with nhits=2;
?). So, the "misleading" annotation will not affect the downstream analysis.
Thank you so much for your patience and clarification!
Best, Han
# RepeatMasker output
315 19.2 0.0 0.0 -10001037-10001039.INS.1 2 53 (2965) + FRAM SINE/Alu 115 166 (10) 1
385 30.1 1.7 1.5 -10001037-10001039.INS.1 145 194 (2824) + SVA_A Retroposon/SVA 68 225 (1162) 2
377 24.0 0.0 0.0 -10001037-10001039.INS.1 195 269 (2749) C 7SLRNA srpRNA (241) 79 5 3
385 30.1 1.7 1.5 -10001037-10001039.INS.1 270 294 (2724) + SVA_A Retroposon/SVA 226 312 (1075) 2
452 24.4 8.2 1.9 -10001037-10001039.INS.1 310 455 (2563) + MIR SINE/MIR 2 156 (106) 4
1312 12.1 0.0 0.0 -10001037-10001039.INS.1 483 655 (2363) C AluSg SINE/Alu (11) 299 127 5
16 0.0 0.0 0.0 -10001037-10001039.INS.1 659 675 (2343) + (A)n Simple_repeat 1 17 (0) 6
272 28.2 7.3 4.0 -10001037-10001039.INS.1 690 785 (2233) C MLT1J LTR/ERVL-MaLR (413) 99 1 7
787 7.1 0.0 0.0 -10001037-10001039.INS.1 972 1070 (1948) C HY3 scRNA (2) 99 1 8
451 22.6 3.6 10.8 -10001037-10001039.INS.1 1160 1298 (1720) + SVA_A Retroposon/SVA 189 318 (1069) 2
353 12.5 1.1 18.7 -10001037-10001039.INS.1 1221 1308 (1710) C AluJr SINE/Alu (232) 80 6 9 *
441 28.4 15.1 1.1 -10001037-10001039.INS.1 1309 1658 (1360) + L2c LINE/L2 2989 3387 (0) 10
399 28.4 0.0 0.0 -10001037-10001039.INS.1 1734 1835 (1183) C SVA_A Retroposon/SVA (1087) 300 199 11
2123 14.5 0.0 0.3 -10001037-10001039.INS.1 1861 2164 (854) C AluJo SINE/Alu (7) 305 3 12
2304 9.2 0.3 0.0 -10001037-10001039.INS.1 2171 2462 (556) C AluSg SINE/Alu (17) 293 1 13
503 26.4 0.8 0.8 -10001037-10001039.INS.1 2495 2620 (398) + SVA_A Retroposon/SVA 209 334 (1053) 14
207 11.4 15.9 0.0 -10001037-10001039.INS.1 2869 2912 (106) + MIR SINE/MIR 211 261 (1) 15
476 19.2 0.0 0.0 -10001037-10001039.INS.1 2936 3013 (5) + FRAM SINE/Alu 1 78 (98) 16
Hi Han,
You are absolutely right; for the Twin Primed L1, I keep 2hits + indeed. Here's the filter I used with R
(you can find all the scripts here)
library(vcfR)
HPRCvcf<-read.vcfR("HPRC_pangenome.vcf")
Lfilter<-abs(as.numeric(extract.info(HPRCvcf, "SVLEN"))) >= 250
HitsFilter<-as.numeric(extract.info(HPRCvcf, "n_hits")) == 1 | extract.info(HPRCvcf, "mam_filter_1") != "None"
SVAVNTF_filter<-extract.info(HPRCvcf, "mam_filter_2") == "None"
TEclassFilter<-grepl(paste(c("Alu", "L1", "SVA"), collapse = "|"), extract.info(HPRCvcf, "repeat_ids"))
HPRCvcf_F<-HPRCvcf[Lfilter & HitsFilter & TEclassFilter & SVAVNTF_filter]
I'll close this issue for now, but don't hesitate to reach out if you have more questions/suggestion, we deeply appreciate it!
Cheers,
Clément
Hi,
I am trying to use the pipeline to annotate my VCF. It seems there is a bug when calculating the span of TE due to 1-based coordinate.
When calculating span/length of TEs, the start and end position of the match on query sequence are extracted from RepeatMasker/OneCode output, which are 1-based. To calculate the length of the match, we need to do
end - start + 1
.I currently find that this affects repmask_vcf line 79
and annotate_vcf.R line 64:
Please correct me if I was wrong.
Moreover, I am also wondering if the "length" refers to the window a TE span on the SV or the length of SV matched by TE.
For example, the output of one SV is:
The SVA_A are mapped to multiple regions of the SV, starting from 145 to 2620. The aligned TE's length is much less than the region between start and end.
Thanks a lot.