final consensus sequences

tbadet commented 3 years ago

Hi Kevin,

first thanks a lot for generating this highly automated and comprehensible tool for transposon studies. I believe it will prove great use for the community.

I have briefly applied the reasonaTE annotation pipeline to the fungal species I currently work with and for which we have applied a similar set of discovery tools and manually curated the set of transposons identified across a set of 19 chromosome-level assemblies of the species [fungi] (https://doi.org/10.1186/s12915-020-0744-3).

Overall, great news, your pipeline was able to recover the large majority of the elements I have in my manually curated dataset. If I may, I have a couple of questions regarding the annotation/classification pipeline.

First question would be on how to best integrate the results from multiple assemblies of the same species into a final set of consensus sequences? With my current dataset I tried to simply concatenate the resulting consensus for the 19 assemblies and perform an additional clustering step (cd-hit) but I am still dealing with > 20k consensus sequences, given 304 transposons in my manually curated dataset (a lot of redundancy I suspect).

Second question is related to the construction of consensus sequences. reasonaTE gives me consensus sequences > 100 kb (unlikely true single transposons). Indeed, these large elements predicted by reasonaTE overlap multiple consensus sequences in my curated dataset but also in its own set of consensus. Shouldn’t elements encompassing 2 or more other elements (clustering and blast steps) be considered as additional copies of these 2 or more individual elements but dropped as a new elements themselves?

Another subsidiary question would be on the best way to retrieve consensus from high-copy elements that didn’t pass the classification algorithm?

Finally, more related to the deTEct pipeline, as you mention in your manuscript, read-mapping inherently comes with some limitations. There are now some methods for structural variation discovery using results from whole-genome assembly alignments, do you foresee their implementation into your tool?

Thanks a lot again for your great preprint.

Bests,

Thomas

DerKevinRiehl commented 3 years ago

Hi Thomas, thank you for your interest in TransposonUltimate and checking it out :-)

Part 1 Overall, great news, your pipeline was able to recover the large majority of the elements I have in my manually curated dataset. If I may, I have a couple of questions regarding the annotation/classification pipeline.

Glad to hear :-D. Of course, it always depends on the setup and specific dataset, but I am happy that you could use it and show that the ensemble of many different tools is beneficiary.

Part 2 First question would be on how to best integrate the results from multiple assemblies of the same species into a final set of consensus sequences?

I think thats a good question. So how reasonaTE is doing it, as described in the manuscript and seen in the code, we use CD-Hit. In general, it all depends on thresholds you set, right? How similar the sequences have to be etc. Unfortunately, the problem of generating consensus sequences is another bioinformatic task, that I did not work on and therefore also cannot provide expertise / suggestions on.

I guess you could try different tools (have a short literature research / online research / tools your lab colleagues use) and see which parameters they offer and how similar the "cluster" results of different tools are.

Part 3 Second question is related to the construction of consensus sequences. reasonaTE gives me consensus sequences > 100 kb (unlikely true single transposons)

Well similar to your question before, I cant offer to much advice here. And again, I think it depends on thresholds you could set to find shorter consensus sequences in larger clusters.

In fact, some transposons such as helitrons are known to be larger, but 100kb indeed is too long to be true^^. However, I think any tool produces output based on the assumptions and input data. Users always need to postprocess data if they need additional assumptions (such as a maximum length of transposons). I remember we set the max length threshold to 1% of the total genome length. Any assumption is questionable here.

Part 4 Another subsidiary question would be on the best way to retrieve consensus from high-copy elements that didn’t pass the classification algorithm?

In addition to my prior answers on consensus sequence generation, I want to emphasize that all sequences pass the classification algorithm RFSB. As asked before in another issue, the classification algorithm will not classify whether the sequence is or is not a transposon. It just tries to predict the most probable transposon class for the sequence. https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/issues/2

Part 5 There are now some methods for structural variation discovery using results from whole-genome assembly alignments, do you foresee their implementation into your tool?

So in fact we evaluated three tools: Sniffles, SVIM and PBSV. (SVIM not reported in the manuscript but in the masterthesis that I worked on) SVIM simply did not provide good results in our dataset.

From my perspective the implementation effort to integrate the output of other SV callers is quite low. Do you have a suggestion for tools to consider, please let me know and I will implement them. (Please name a list of tools) I guess all SV callers should report their results in a VCF like format, right?

Thank you very much for your interest, and hope to hear back from you soon, Best regards, Kevin

tbadet commented 3 years ago

Dear Kevin,

thanks a lot for your prompt and detailled answers.

Regarding yours and potential tools to consider for whole-genome based variant detection, I think SyRI could be a good option to look into (manuscript).

The tool takes nucmer whole-genome alignments as input to call multiple types of structural rearrangements and can output a VCF file format.

Best regards and thanks again for your great tool.

Thomas

DerKevinRiehl / transposon_annotation_reasonaTE

final consensus sequences #3