dontkme / EAHelitron

Easy to Annotate Helitrons Unix-like command line.
MIT License
9 stars 0 forks source link

How to prepare ReapeatMasker lib using EAHelitron results? #6

Open xiekunwhy opened 2 years ago

xiekunwhy commented 2 years ago

Hi,

Do you familar with RepeatMasker? And how to prepare ReapeatMasker lib using EAHelitron results?

Best, Kun

dontkme commented 2 years ago

Hi,

I think we could merge the GFF files of EAHelitron and RepeatMarsker. Alternatively, we can use the 3' or 5' end fasta files exported by EAHelitron to make the RepeatMarkser library. This links may be helpful. (https://www.biostars.org/p/286992/)

Best, Kaining

xiekunwhy commented 2 years ago

Hi Kaining,

Thank you for your reply. But I think the answer not solved my problem yet. Do you mean we can use *.3.txt or *.5.txt file to make RepeatMarkser library? But all sequences in these two files are very short and the helitron-related sequences in repbase are not so short, the athrep.ref file was attached, you can check some helitron-related sequences like: >ATREP6 Helitron; >Helitron-5_AT Helitron; >HELITRONY1C Helitron and so on. athrep.zip

Best, Kun

dontkme commented 2 years ago

Hi Kun,

I think this is based on the minimum length allowed by the search engine you use in RepeatMasrker (e.g. rmblast). If you are interested, you can try using the shortest possible full-length Helitron sequence per 3' end, (*.5.fa), as a library.

Best, Kaining

xiekunwhy commented 2 years ago

Hi Kaining,

I will try 3' and 5' and *.5.fa respectively, possible full-length helitron number are too small in some genomes results.

Best, Kun

dontkme commented 2 years ago

Hi Kun,

The number of possible full-length Helitrons is related to your -u setting. If it is not long enough to recognize both the 5' and 3' ends, EAHelitron will not output full-length records.

Thanks, Kaining

xiekunwhy commented 2 years ago

Hi Kaining,

Thank you for your reply. And now the question is how to choose a resonable -u value.

I am now annotating a plant genome (~1G), there are 977 sequences in .5.fa file when -u 3000, and the sequence number change to 1819 when -u 5000. When I give a extremely large -u value (50000), the sequence number in .5.fa file is 24726, much larger than the record number in *.bed file (16317 lines). So the false positive number is become larger and larger when -u value increase, right?

And is there a way to choose a resonable -u value?

Best, Kun

dontkme commented 2 years ago

Hi Kun,

No, the -u option does not affect the false positive rate, which is calculated from the 3' end, and -u only determines how long the sequence upstream of the 3' end is used to predict the 5' end. The EAHelitron will reported all 5'end to 3' end sequences in -u region, it may have multiple 5' ends of a 3' end. Therefore, you can use a large number -u setting and choose the shortest full-length Helitron every 3' end as the library. I think you can use the longest reported Helitron length as the -u value in your plant genome.

Thanks, Kaining