DerKevinRiehl / transposon_classifier_rfsb

Transposon classification tools RFSB, part of TransposonUltimate
GNU General Public License v3.0
13 stars 3 forks source link

There are always too many LTRs no matter what sequences I provided. #3

Open xiekunwhy opened 2 years ago

xiekunwhy commented 2 years ago

Hi,

I found that there are always too many LTRs no matter what sequences I provided.

For example, 1) helitronscanner(https://sourceforge.net/projects/helitronscanner/) results as input, image

2) EAHelitron results as input (EAHelitron.5.fa), image , 3) Tirvish (http://genometools.org/tools/gt_tirvish.html) results as input (extracted repeat_region from tirvish gff3), image

as you can see, there are always too many LTRs. What reasons do you think may cause this problem and how to solved?

Best, Kun

DerKevinRiehl commented 2 years ago

Dear Kun, thank you very much for your interest in our tool.

When an annotation tool like "Helitronscanner" finds helitrons, there are several problems with its findings:

Our classifier "RFSB" is a classification model that is (as you can see in our publication) proven to outperform existing classification approaches and was trained on one of the largest transposon databases TransposonDB available (see our publication).

Now if you apply RFSB to annotations from a given tool like Helitronscanner...

We did some interesting analysis in Supplementary Figure S4 and S5 on this, please check it out.

Please find more information in our publication.

Hope this could help you a little, please let me know more of your questions. Best regards, Kevin

xiekunwhy commented 2 years ago

Hi Kevin,

I just want to construct a high quality custom RepeatMasker library, delete false positive sequences as possible as I can, so I try to use RFSB to classify first. But I think I can not rely on RFSB so much according to the results I have tested, especially for helitron repeats.

Do you have any suggestion for constructing a high quality custom RepeatMasker library?

Best, Kun

DerKevinRiehl commented 2 years ago

Hi Kun, ah I understand.

Unfortunately, I dont have another suggestion. The only thing I can do is emphasize, that RepeatMasker is not necessarily the most trustworthy source / tool we have, and therefore would not believe every result that RepeatMasker produces.

Hope this could help, Best, Kevin

xiekunwhy commented 2 years ago

Hi Kevin,

I agree with you, but reviewers and supervisor do not agree because there is no papers said that RepeatMasker is not a reliable tool.

Best, Kun

DerKevinRiehl commented 2 years ago

Hi Kun, there is also no paper that sais RepeatMasker is always right. :-p

Best, Kevin

soungalo commented 1 year ago

I have to agree with @xiekunwhy - I just ran RFSB on all A. thaliana repeat sequences as annotated in Ensembl. The output is entirely LTRs (mostly Gypsy, some Copia). In contrast, the Ensembl annotation contains other repeat types:

 217253 Low_complexity_regions
  52694 LTRs
   2999 Other_repeats
    215 RNA_repeats
   2002 Satellite_repeats
  40526 Tandem_repeats
  10321 Type_II_Transposons
   3561 Type_I_Transposons_LINE
    772 Type_I_Transposons_SINE
  30536 Unknown

I get it that the Ensembl annotation pipeline is not necessarily very accurate (it depends mainly on RepeatMasker and dust), but it's still weird that 100% of the repeats are annotated as LTRs.
Any idea what's happening here?