Closed cuiyangkai closed 1 month ago
Dear @cuiyangkai,
This is a great observation, and my interpretation is that your data is actually representative of a typical SV experiment. For a given reference genome, the ratio of INS/DEL SV will depends on the allele frequency spectrum, and especially the frequency of rare allele (or rare TE insertions in this case).
In human, most TE polymorphisms segregate at low frequencies. In other words, there is a lot of TE variants found in only a single individual. Conversely, common TE variants (say, found in 5%-95% of the population) are much less common. A common TE variants has a higher probability to be in the reference genomes, and thus show up as "deletion" when comparing other genomes; conversely, and rare TE variants will most often be found in another genome than the reference, and thus will be more frequently seen as an "insertion". We've actually looked into it here:
Most of the "Reference" (which mean TE present in the reference but absent in at least one sample -- a.k.a. deletions) are segregating at higher frequencies than "non-reference" TEs (insertions).
We can recapitulate this with the diagram below. I represented a reference genomes and some alternative haplotypes carrying TEs (circles). Some are "high frequency" (the blues) and some rare (the reds). The blues have a higher chance to be found in the reference genome, so they will show up more frequently as "DEL" than the red variants, which are rare (so rare here that I represent them as present in just one of the genomes); from the perspective of the reference genome, these rare variants will be almost all the time "INS". An you can see that these proportion will hold even if we change the reference.
So in my opinion, the difference you see is normal and indicate that your variable TE segregate mostly at low frequency, which is expected.
I hope that makes sense! Let me know what you think!
Cheers,
Clément
Regarding your TE library, I was wondering: did you remove redundancy after combining the TE library of each of the 20 genomes? If you ran EDTA independently, you can expect for several families (the older, mostly fixed ones) to be found in most genome, and thus your concatenated library will include multiple consensus for representing the same family (near duplicate or exact duplicates). This can confuse RepeatMasker, that randomly assign one or the other consensus to a given variant while they all belong to the same TE family. You can use a simple clustering approach to remove redundancy, based on sequence similarity.
Hi ,
Thank you very much for your patient response to my questions. Your understanding of this field is so professional that you immediately pinpointed the crux of the issue. Your explanation was incredibly clear and illustrative, making everything so much easier to understand. I hope that through more learning, I can become as knowledgeable and professional as you.
Best regards, Yangkai Cui
Hello,
I'm currently using GraffiTE to identify TE-associated SVs from a VCF file containing the SVs of interest. My command is as follows:
nextflow run /usr_storage2/cyk/work/software/GraffiTE/main.nf \ --assemblies /usr_storage2/cyk/work/dyy1/GraffiTE/assemblies.tsv \ --TE_library TElib.fa \ --reference /usr_storage2/cyk/work/dyy1/panpop_pipline/Ref.fa \ --reads /usr_storage2/cyk/work/dyy1/GraffiTE/longreads.tsv \ --vcf /usr_storage2/cyk/work/dyy1/newTE/new.vcf
While reviewing the results, I noticed that the number of TE-associated deletions (DEL) is significantly lower than the number of TE-associated insertions (INS), with counts of 3,188 and 58,236, respectively. I'm unsure why this discrepancy exists. It could be due to an issue with my TE library. My TE library was generated using EDTA; I ran EDTA on my 20 samples and then concatenated the resulting library files.Do you have any suggestions or solutions for this issue based on the information provided above?