4ureliek / TEorthology

Script to help assessing the evolutionary history of a TE family
MIT License
14 stars 4 forks source link

Little piece of advice of TEorthology #2

Closed adeoliveira86 closed 8 months ago

adeoliveira86 commented 8 months ago

Dear @4ureliek,

Thank you for all the available and resourceful scripts. I have been using parseRM and now I am playing around with TEorthology and it is making my life really easier. I would like some piece of advice of TE orthology before I run a massive analysis with it. My intended goal is to identify how a bunch of TE elements already identified across a invertebrate lineage are evolutionary related to each other. My ultimate goal is to identify the core TE elements (TEs found in all analyses organisms), lineage-specific ones, and also some that are shared between between two or more animals. In short words, I would like to generate a pan-repeatome for my clade of interest. I think based on the description of your tool that it would be idea for this kinda of analysis. Then I would use this to reannotate the genomes.

I would like to know if I it is possible to see the output of the software to check if this would fit my goals before investing a lot of time into running the software. Since this is virtually a all-against-all comparison and the .out RM files are quite massive, I am assuming this will take a fair amount of time. Additionally, I have a question about the parameter tree used by the software. Is this tree used as a backbone to infer presence/absence in the species I am analysing? Running with and without it changes much the output of running time?

Best regards and thanks again for the scripts, André

4ureliek commented 8 months ago

Hi,

For the easy answer: the tree is only used to order the columns in the output, which is a user convenience thing I remember adding so it was easier to look at the results, but won't affect the results themselves.

Regarding an output, unfortunately, this is old enough that I do not have outputs on hand. I would really recommend creating test files and running the code with them (will also allow you to see if everything runs right). For example I would use 100 lines from the repeat masker .out file you have, and provide 2 genomes (the same one as the RM .out file and another one), or run it with -filter set to one repeat only, to limit the number of repeats that are blasted. Nothing really replaces looking at the data on your own for a small sample size, with data you know, before running the whole thing!

The code will take copy+flanking sequences when the flanking are not a TE in the RM.out file (this is the first caveat: this requires well annotated genomes as input), then blast that. The output will have columns for each provided genomes, with counts of extracted copies, then the number of total hits, and then hits with the flanking sequences present (when the flanking sequences are present, and you know the input was not nested in some repeats, then this is in favor of this copy being orthologous between the genomes). Note that the results are rarely clear cut; you'd expect false negatives and false positives.

The code is blast based. It uses blast defaults, and the cutoff evalue is set by the user with -e. Lowering it will speed things up but it depends what divergence you expect between your most diverged genomes.

adeoliveira86 commented 8 months ago

Hello @4ureliek,

thanks for the swift answer. I will follow your instructions and try to narrow down the analyses to a couple of TE families which interest me the most. I am afraid computing all elements will be extremely time-consuming.

A short additional question. Is there any available scripts designed to plot the results originated by parseRM (including the repeat landscape). I have written a few Rscripts to parse the divsum files generated by calcDivergenceFromAlignment.pl and I guess I can adapt those for this purpose.

Cheers, André

4ureliek commented 8 months ago

Yes you are correct, it will likely take a while even with parallelization, since blast is just not that fast, and the code not optimized (e.g. if blast jobs are too large, splitting query and DBs into chunks speeds things up massively).

I have not written code to plot automatically the repeat landscapes as far as I can remember - I would need to go dig into my archives, but for these landscapes I made the paper figures with Illustrator, not R (https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.13295)

adeoliveira86 commented 8 months ago

I will take the "splitting" into consideration. Let's see what will come from my analyses, but at any case, thanks again for your help and scripts.

4ureliek commented 8 months ago

Oh, I meant inside the code - splitting the inputs will be annoying to do, unless it's with using the filter option. Let me know how it goes!

4ureliek commented 8 months ago

Hey @adeoliveira86 - I found an example of output; it was ran with cat, dog and human.

Columns:

#"Total_hits" means total of hits with evalue less than the one set through -e
#"ortho_hits" means POTENTIALLY orthologous elements = hits that contain at least 20% of the extracted flanking sequence
#Acess to individual copies through files in ./orthology.Cat.Dog.Cow.Human/DATA/*.extract/*.out.ortho.tab

For example:

  [Query_species]   Target_species#(with tree info):          
      (hg38 , ((canFam3 , felCat5) ,
      hg38 hg38 canFam3 canFam3 felCat5 felCat5
Rname [name] [Nb_extracted_seq] Total_hits Ortho_hits Total_hits Ortho_hits Total_hits Ortho_hits
Tigger1#DNA/TcMar-Tigger canFam3 2607 3173 0 3348 2173 1649 35

In this case you can see that most of the dog copies of Tigger1#DNA/TcMar-Tigger are recovered as potentially orthologous (it is never 100% because of how blast will work); human has none, and cat has so few they are probably false positives.

I used this to check whether TEs looked like dog/cat ones or more ancient than that. You'd have to decide on your cutoffs depending on the 'recovery rate' and how some TEs you know are shared behave (depends on quality of assemblies, divergence between species, etc).

adeoliveira86 commented 8 months ago

Hello @4ureliek,

thanks for sharing the output. This looks like something that could be extremely useful for the questions I am asking. Do you mind if I get in touch with you via email to discuss a little bit more what I am envisaging and have your feedback?

4ureliek commented 8 months ago

Sure thing - you can my gmail.com email, with prefix 4urelie.k