reasonTE found less TEs than when running RepeatMasker alone & how to generate soft-masked genome?

WDBS369 commented 1 year ago

Hi Kevin,

Thanks for sharing such a great TE annotation tool!

I have two questions regarding the reasonTE pipeline.

1) Why reasonTE has less TEs found compared to that of RepeatMasker output? I have been running reasonaTE pipeline on a couple of genomes. As a comparison with previous publications which predominantly relied on repeatMasker results (this one for instance), I found that reasonaTE produced less abundance of TEs. With the repeatMasker alone, I used a custom library of RepeatModeler output + Repbase, which maksed 28.80% of the genome (~140mbp) and is similar to the publication. Yet with reasonaTE, the "FinalAnnotations_TransposonSequences.fasta" from the "finalResults" are only ~90mbp. The number of TEs found from reasonaTE (58106) is also much smaller than that from repeatMasker (~400000). Even the abundance of SINEs and LINEs are much less, both number-wise & length-wise.

Do you have any idea why this happens? I thought ReasonaTE would find more TEs as it incorporates more TE annotation tools. I attached both RepeatMasker and reasonaTE results below for your convenience.

RepeatMasker output:

reasonaTE output (Statistics_FinalAnnotations.txt):

2) Can reasonaTE produce soft-masked (lower-case) genome? I plan to soft-mask the genomes as inputs for braker2 for gene annotations. If reasonaTE can do this, that would be fantastic. But if not, can I use RepeatMasker with "FinalAnnotations_TransposonSequences.fasta" as the reference library to soft-mask the genome?

Thanks in advance and have a wonderful day!

DerKevinRiehl commented 1 year ago

Dear Chan Liu, thank you very much for your interest in TransposonUltimate :-).

1) Well it depends how you run RepeatMasker (which input parameters you use and which RepeatMasker installation you use). The pipeline is calling RepeatMasker (the one installed via conda environment and with very basic / generic parameters). I guess, when you install RepeatMasker (probably by yourself and not via conda) and run it with parameters (as described by that paper) it can find more results. What reasonaTE does is to combine the knowledge of many different tools. So actually, you could aggregate your results from RepeatMasker with those from many other tools. The big advantage of this is, that you can see whether other of the many tools has similar findings on the results that you produce with RepeatMasker.

I hope this answers the question a little? Please let me know in case of further questions.

2) I am not quite sure what you mean with soft-masking. If you refer to soft-masking as masking with software, and hard-masking as manual masking, then yes, reasonaTE can do this. This file: FinalAnnotations_TransposonMask.gff3 is the transposon mask and this file the transposon annotations FinalAnnotations_Transposons.gff3. The file mentioned by you is just a fasta file containing all sequences considered as transposon by reasonaTE.

Hope this could help you, Looking forward hearing back from you, Best, Kevin

WDBS369 commented 1 year ago

Hi Kevin,

Thanks for the quick and kind response!

For question 1, the RepeatMasker part, I actually did use conda installed RepeatMasker and run it with reasonaTE. The process for my running is as follows: [1] I ran reasonaTE pipeline with most of the tools, except for RepeatMasker. reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool helitronScanner reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool ltrHarvest reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool mitefind reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool mitetracker reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool must reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool repeatmodel reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool sinefind reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool sinescan reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool tirvish reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool transposonPSI reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool NCBICDD1000

[2] Then, I ran RepeatMasker with repeatmodeler's output + Repbase database as custom library. reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool repMasker xxxxx -pa 12 -lib ArthropodaRepbase_Repeatmodeler.lib

[3] Finally, I parsed the annotations, ran the final pipeline, and got the stats, as per your reasonaTE pipeline. reasonaTE -mode parseAnnotations -projectFolder workspace -projectName testProject reasonaTE -mode pipeline -projectFolder workspace -projectName testProject reasonaTE -mode statistics -projectFolder workspace -projectName testProject

But I found that the final statistics have less TEs found compared to the previous study which majorly used RepeatMasker & RepeatModeler for their report. So I went on checking the RepeatMasker's output from step [2] above and found that it did come pretty similar to the publication's results. It is just that after parsing & combing the annotations from different tools, the amount of reported TEs became much less compared to from RepeatMasker alone.

Now after thinking of it again, do you think it is because reasonaTE uses a different classifier (RFSB) and only TEs with certain features (like structual & protein features) will be kept, which results in less TEs reported?

For question 2, I am sorry for not clarifying what the soft-masking is. Soft-making is the conversion from upper-case to lower-case (e.g., ATG to atg) so that gene annotation tools such as Braker2 can ignore these converted sequences (which are usually TEs and low complexity sequences) to reduce false positives. So the final product I want is a fasta file of genome with all TE sequences of uppercases converted to lower-case sequences. BTW, hard-masking is the conversion from upper-cases to Ns, (e.g., ATG to NNN). So you can see with hard-masking, all the target sequences will lose all their info, that is why I desire soft-masking instead of hard-masking.

Thanks so much!

DerKevinRiehl commented 1 year ago

Dear Chan,

For question 1) I think what you say sounds plausible. I think the reason why you end up in less TEs reported is, that many tools, including RepeatMasker and RepeatModeler, annotate transposons multiple times. So their annotations can intersect quite a lot. Our software uses multiple procedures to reduce the amount of transposons e.g. by "dropping" and "merging" (its a little more complicated though) duplicated annotations. It can be worth to visually have a look on their and our annotations in a software. You will see that many duplicates exist. Other than that, feel free to combine the outputs of the various tools in a different manor. There is file in the output that contains all annotationis from all tools in the folder "parsedAnnotations" https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/tree/main/workspace/testProject/parsedAnnotations.

For question 2) Well there is no such a functionality in reasonaTE, but if you are able to write and run a Python script yourself, give this a try. Maybe I find some time in the following days and I can write one for you. Would you like one?

Best, Kevin

WDBS369 commented 1 year ago

Hello Kevin,

Thank you so much for being so patient and kind the whole time! I really appreciate it!

I think there are some tools out there which can do the soft-masking, I can give it a try first. If I couldn't find a solution, I will definitely come back and seek for your help!

Best, Chan

DerKevinRiehl commented 1 year ago

Hey Chan, I just found that BEDTOOLS could be used for your problem. https://bedtools.readthedocs.io/en/latest/content/tools/maskfasta.html

You can input a FASTA file (the genome) and the annotations (BED FILE) and it will either hard or softmask your genome and output it (as FASTA file). The only thing you need to do is to convert my annotations (GFF3) to (BED) format. You could use any tool like this: https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/gff2bed.html or this https://www.biostars.org/p/321562/

Hope this helps, please let me know if you were able to do it. Thanks, Best, Kevin

WDBS369 commented 1 year ago

Hi Kevin,

Thanks for letting me know! I have also found that Bedtools was well-suited for my study and have been working on it. So far, the results seem pretty good.

Best, Chan

WDBS369 commented 1 year ago

Hi Kevin,

Do you gff3 files use a zero-based coordinate system-- where the leftmost coordinate is off by one? That means, for example, the beginning of a sequence is marked "0" instead of "1" as the start position . I am asking this because normal gff3 files are one-based coordinate where the beginning of a sequenced will be "1" as start position, whereas BED files are normally zero-based. e.g. zero-base vs one-based coordinate ATCG---0123 ATCG---1234

The conversion between gff2 and BED is majorly a work of transformation of coordinate system (from one-based to zero-based). Therefore, if your gff3 files are already using zero-based coordinate, there is no need to do a BED conversion anymore in my case.

The reason I am suspicious of this is because when I tried to convert some of your gff3 to BED, I got errors that suggest the start coordinate is problematic. So I looked up and found that some of the start positions do start with 0 instead of 1 (please see pic attached below).

Thanks and have a great one!

Best, Chan

DerKevinRiehl commented 1 year ago

Dear Chan, you are right, my annotations are already zero-based.

Did everything work for you? Best, Kevin

WDBS369 commented 1 year ago

Dear Kevin,

Thanks for the confirmation! And yes, everything has worked perfectly.

Best, Chan

DerKevinRiehl commented 1 year ago

Great, then have a nice day and please do not hesitate to contact me again if you have any further issues :-)

DerKevinRiehl / transposon_annotation_reasonaTE

reasonTE found less TEs than when running RepeatMasker alone & how to generate soft-masked genome? #15