Combining dnaPipeTE and DeepTE outputs

clemgoub / dnaPipeTE

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

50 stars 11 forks source link

Combining dnaPipeTE and DeepTE outputs #62

Closed heidihyang closed 2 years ago

heidihyang commented 2 years ago

Hi Clément,

I wanted to increase the number of classified TEs from the dnaPipeTE sequences so I used deepTE per one of your suggestions in one of the other discussions. I want to combine the data from both programs, and I was wondering if you have any recommendations for doing so. Do I just need to run the blast section of the pipeline for all of the deepTE sequences, since I would need the blast_reads.counts and other files to make the new graphs? Thanks in advance!

Best, Heidi

clemgoub commented 2 years ago

Hello Heidi!

The simplest is to: 1- Create single library where for each contig you will have retained an annotation (either from RepeatMasker or DeepTE). You may have to decide on a rule to choose which annotation you retain for a given contig. Be sure to don't have duplicated contigs. 2- Make sure that the DeepTE annotations are in the RepeatMasker format. The headers in your new library should be of the form >TEname#Subclass/superfamily -- the list of recognized Subclass/superfamily labels are present in the file new_list_of_RM_superclass_colors_sortedOK in the dnaPipeTE folder, or you can find it here 3- Use your newly annotated library in a new run of dnaPipeTE with the option for a custom library: RM_lib custom_library.fasta

Let me know if you need more help!

Cheers,

Clément

heidihyang commented 2 years ago

Hi Clément,

Thank you for the info, sounds good! Would I combine it to the species lib in RepeatMasker? Also, DeepTE has some classifications that are MITEs of certain TEs (ex. a DNA hAT MITE). How should I classify this in the correct format? One last question (sorry, kind of new to all of this), is there a way to bypass the Trinity step since I already have contigs from the previous runs? I have limited computational allowance on my cluster and want to conserve it if possible. Thanks!

Best, Heidi

clemgoub commented 2 years ago

Hi Heidi!

No problem at all I'm glad you reach out with questions!

You can combine with RepeatMasker's libraries indeed (It can be actually interesting to compare a run with and one without)
For the MITE, to be labeled correctly in dnaPipeTE, you need to have a header of the form TENAME#MITE/MITE. To keep the DNA/hat info, you may rename it TE_XX_DNA_hAT#MITE/MITE (as long as #MITE/MITE is present it will be recognized as MITE)
You should be able to bypass the Trinity step if it has run completely, by re-running the same command with the same output folder. Let me know if it doesn't do what you want!

Cheers,

Clément