There are always too many LTRs no matter what sequences I provided.

xiekunwhy commented 2 years ago

Hi,

I found that there are always too many LTRs no matter what sequences I provided.

For example, 1) helitronscanner(https://sourceforge.net/projects/helitronscanner/) results as input,

2) EAHelitron results as input (EAHelitron.5.fa), , 3) Tirvish (http://genometools.org/tools/gt_tirvish.html) results as input (extracted repeat_region from tirvish gff3),

as you can see, there are always too many LTRs. What reasons do you think may cause this problem and how to solved?

Best, Kun

DerKevinRiehl commented 2 years ago

Dear Kun, thank you very much for your interest in our tool.

When an annotation tool like "Helitronscanner" finds helitrons, there are several problems with its findings:

It could be, that the tool annotated sequences that are not transposons at all
It could be, that the tool annotated transposons, but not necessarily helitrons
It could be, that the tool annotated helitron transposons

Our classifier "RFSB" is a classification model that is (as you can see in our publication) proven to outperform existing classification approaches and was trained on one of the largest transposon databases TransposonDB available (see our publication).

Now if you apply RFSB to annotations from a given tool like Helitronscanner...

It could be that you ask RFSB to classify a sequence which is not a transposon. Therefore, the classified class might be junk. RFSB was not build to decide / classify if a sequence is a transposon or not, it was build to, given the assumption there is a transposon sequence, to classify this sequence according to a taxonomic scheme.
Helitronscanner, and the many other tools out there definetly also do mistakes in their annotations, even they annotate transposons, its possible that some of them are actually not what they assume it to be, therefore RFSB disagrees with these tools.

We did some interesting analysis in Supplementary Figure S4 and S5 on this, please check it out.

Please find more information in our publication.

Hope this could help you a little, please let me know more of your questions. Best regards, Kevin

xiekunwhy commented 2 years ago

Hi Kevin,

I just want to construct a high quality custom RepeatMasker library, delete false positive sequences as possible as I can, so I try to use RFSB to classify first. But I think I can not rely on RFSB so much according to the results I have tested, especially for helitron repeats.

Do you have any suggestion for constructing a high quality custom RepeatMasker library?

Best, Kun

DerKevinRiehl commented 2 years ago

Hi Kun, ah I understand.

Unfortunately, I dont have another suggestion. The only thing I can do is emphasize, that RepeatMasker is not necessarily the most trustworthy source / tool we have, and therefore would not believe every result that RepeatMasker produces.

Hope this could help, Best, Kevin

xiekunwhy commented 2 years ago

Hi Kevin,

I agree with you, but reviewers and supervisor do not agree because there is no papers said that RepeatMasker is not a reliable tool.

Best, Kun

DerKevinRiehl commented 2 years ago

Hi Kun, there is also no paper that sais RepeatMasker is always right. :-p

Best, Kevin

soungalo commented 1 year ago

I have to agree with @xiekunwhy - I just ran RFSB on all A. thaliana repeat sequences as annotated in Ensembl. The output is entirely LTRs (mostly Gypsy, some Copia). In contrast, the Ensembl annotation contains other repeat types:

 217253 Low_complexity_regions
  52694 LTRs
   2999 Other_repeats
    215 RNA_repeats
   2002 Satellite_repeats
  40526 Tandem_repeats
  10321 Type_II_Transposons
   3561 Type_I_Transposons_LINE
    772 Type_I_Transposons_SINE
  30536 Unknown

I get it that the Ensembl annotation pipeline is not necessarily very accurate (it depends mainly on RepeatMasker and dust), but it's still weird that 100% of the repeats are annotated as LTRs.
Any idea what's happening here?

DerKevinRiehl / transposon_classifier_rfsb

There are always too many LTRs no matter what sequences I provided. #3