cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
107 stars 4 forks source link

Inquiry about Integrating Minigraph-cactus into the Pipeline for Enhanced SV Analysis #20

Closed porkfan closed 2 months ago

porkfan commented 8 months ago

Dear Guillaume Bourque's group,

Firstly, I would like to express my sincere gratitude for your remarkable software. It is indeed a significant contribution to the field. I've been using your pipeline for SV (Structural Variant) analysis, but I've encountered some intriguing differences in the results when comparing two methods.

I have been generating pan-genome SVs using Minigraph-cactus and also calling SVs based on your pipeline that utilizes svim for comparison against a reference genome. In my tests, I've observed that the alignment results from Minigraph-cactus seem to be more reliable.

Given these observations, I am curious about the possibility of integrating the Minigraph-cactus process into your pipeline. Specifically, I am interested in using the SVs generated by Minigraph-cactus and then applying your TE (Transposable Element) identification process for further analysis and genotyping.

I believe that the incorporation of Minigraph-cactus into your pipeline could enhance the accuracy and efficacy of SV analysis. Could you please let me know if such integration is feasible in your future development plans?

Your guidance and insights on this matter would be greatly appreciated.

Thank you for your time and consideration.

Best regards, Yfchen

cgroza commented 8 months ago

Hi, Thank you for giving our pipeline a try. I think the fastest way to integrate minigraph-cactus into our pipeline is by running vg deconstruct on the minigraph-cactus graph, and passing the VCF into GraffiTE via --vcf for annotation. We never tried this, so we might need to tweak the pipeline a bit to support this.

We could also add support for running minigraph-cactus within the pipeline. I do have experience with minigraph, but never ran cactus myself. Maybe you could contribute a sample of your scripts that could help us?

My thanks, Cristian

porkfan commented 7 months ago

Dear Cristian, Thank you for your prompt response. I apologize for the delay in my reply, as I was on holiday break. I can provide you with the scripts I use for running cactus, as well as an example of the resulting VCF file. However, the VCF files that are actually used for subsequent analyses need to undergo a series of processing steps from the original VCF file, which I can also provide to you. I am very interested in this matter and hope to continue our discussions about integrating cactus. Since the data has not yet been published, I would prefer to send it to you privately via email.

Thank you once again!

Best regards, Yfchen

cgroza commented 7 months ago

As an update, we just successfully processed HPRC with GraffiTE from the VCF obtained by decomposing the pangenome. So an approach where we decompose the minigraph-cactus VCF, filter it for SVs and plug into GraffiTE is viable.

porkfan commented 2 months ago

Thank you for your prompt reply. I also tried using the minigraph-cactus VCF, filtering it for SVs, and then analyzing it with GraffiTE, which indeed worked. However, there is one issue that has been troubling me for a long time without a good solution. Most SV identification software typically focuses on INS and DEL types, and your pipeline also extracts only these two parts for TE identification. However, in the VCF from minigraph-cactus, there is another type of variant named COMPLEX or MNV (>50bp), which should also be defined as an SV type but cannot be recognized by your pipeline. I attempted to write my own code, incorporating some of the code from your pipeline, to achieve this.

There are still some confusions because for this type of SV, both the ref and allele sequences are relatively long and may both contain TE insertions. Unlike INS or DEL, where only one needs to be focused on, handling this type of variant requires identifying and comparing both the ref and allele to identify specific TE insertions. Can you provide some suggestions on how to handle this type of SV variant?

The reason I insist on using the results from minigraph-cactus is that its SV detection performance is better than other software for my species. I believe combining the results from your software and minigraph-cactus would be very beneficial for my research.

Thank you again for your prompt reply, Yfchen

cgroza commented 2 months ago

To truly handle COMPLEX variants, we would need to spell the alleles into sequences by following their paths through the genome graph bubbles, run these through repeat masker and then annotate the graph or the VCF file with the resulting annotation. We cannot handle this right now, since we initially designed the pipeline to look for de novo transposition events (which are not complex SVs), and not genomic rearrangements that happen to involve transposable element sequence.

But these are all great ideas for future development of GraffiTE if we want to attack transposable elements in complex SVs.

porkfan commented 2 months ago

Thank you very much for your prompt reply. I will close my question now.