CSU-KangHu / HiTE

High-precision TE Annotator
GNU General Public License v3.0
68 stars 3 forks source link

The annotations in Gff result file are labeled as RepeatMasker, not HiTE #31

Open wilson1990D opened 4 days ago

wilson1990D commented 4 days ago

CP072753.1 RepeatMasker similarity 1 77 2.7 + . Target "Motif:(AACCCT)n" 1 76 CP072753.1 RepeatMasker similarity 4121 4244 0.0 + . Target "Motif:(CCTAAC)n" 1 130 CP072753.1 RepeatMasker similarity 4245 4853 21.0 - . Target "Motif:chr_0:1750204..1757700_INT" 1 618 CP072753.1 RepeatMasker similarity 4871 5176 33.0 + . Target "Motif:chr_5:364366..379047_LTR" 17 328 CP072753.1 RepeatMasker similarity 5169 5418 28.8 + . Target "Motif:chr_0:1074968..1082160_LTR" 89 335 CP072753.1 RepeatMasker similarity 7891 7941 16.1 + . Target "Motif:GA-rich" 1 55 CP072753.1 RepeatMasker similarity 8046 8227 30.2 + . Target "Motif:chr_0:1124664..1137321_INT" 22 203 CP072753.1 RepeatMasker similarity 8369 8446 19.5 + . Target "Motif:TIR_19" 6688 6765

Thank you very much to the author for this excellent software! Anyway, when I ran HiTE, I obtained a HiTE.gff file with the following results. However, something very strange happened: the results do not match what I saw in the user manual. There are two specific issues: 1.The annotations in my gff result file are labeled as RepeatMasker, not HiTE. 2.There are many results labeled as simple repeat, which seems unusual, not shown in example of manual. Here is the code I used to run HiTE: main.py --genome genome.fa --thread 30 --outdir try-intact --intact_anno 1 --annotate 1

CSU-KangHu commented 3 days ago

Hi @wilson1990D,

When you specify --intact_anno 1 and --annotate 1, the program generates two annotation files: one for intact TEs (HiTE_intact.sorted.gff3) and another for all TE annotations, including intact TEs, fragmented TEs, and simple repeats (HiTE.gff).

The latter file (HiTE.gff) is generated using the TE library produced by HiTE as input for RepeatMasker, so its annotations are labeled as "RepeatMasker," not "HiTE." Additionally, it may include many simple repeats from RepeatMasker's built-in library.

If you prefer to exclude simple repeats from the annotation file, you can modify the RepeatMasker_command in HiTE/module/annotate_genome.py by adding the parameters -no_is, -norna, and -nolow.

You can check the HiTE_intact.sorted.gff3 file, which is labeled as "HiTE" and contains only intact TE annotations, as shown in the user manual.

wilson1990D commented 2 days ago

CP072753.1 HiTE TIR 223653 224024 . + . id=te_intact_233;name=TIR_14;classification=Unknown;tir=NA;tsd=CTTAG;tsd_len=5 CP072753.1 HiTE TIR 478070 478854 . - . id=te_intact_333;name=TIR_2;classification=DNA/MULE;tir=1-80,706-785;tir_identity=0.8625;tsd=ATCAATCGA;tsd_len=9 CP072753.1 HiTE TIR 604187 605645 . + . id=te_intact_64;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=CA;tsd_len=2 CP072753.1 HiTE TIR 643598 650217 . - . id=te_intact_246;name=TIR_12;classification=Unknown;tir=NA;tsd=TTTT;tsd_len=4 CP072753.1 HiTE TIR 652032 653511 . - . id=te_intact_95;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TA;tsd_len=2 CP072753.1 HiTE TIR 804304 805743 . + . id=te_intact_65;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=ATT;tsd_len=3 CP072753.1 HiTE TIR 835014 836442 . - . id=te_intact_94;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=AAT;tsd_len=3 CP072753.1 HiTE TIR 882837 889423 . + . id=te_intact_238;name=TIR_12;classification=Unknown;tir=NA;tsd=AA;tsd_len=2 CP072753.1 HiTE TIR 906647 908135 . + . id=te_intact_66;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TA;tsd_len=2 CP072753.1 HiTE TIR 938081 939551 . + . id=te_intact_67;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=ATT;tsd_len=3 CP072753.1 HiTE TIR 959216 960696 . + . id=te_intact_68;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=ATT;tsd_len=3 CP072753.1 HiTE TIR 973746 975246 . + . id=te_intact_69;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TACGGA;tsd_len=6 CP072753.1 HiTE TIR 985959 987439 . + . id=te_intact_70;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=ATT;tsd_len=3 CP072753.1 HiTE TIR 1046982 1048498 . + . id=te_intact_71;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=AT;tsd_len=2 CP072753.1 HiTE TIR 1060390 1061766 . + . id=te_intact_219;name=TIR_4;classification=DNA/TcMar;tir=NA;tsd=AC;tsd_len=2 CP072753.1 HiTE TIR 1076208 1077660 . + . id=te_intact_72;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TC;tsd_len=2 CP072753.1 HiTE TIR 1093211 1099797 . + . id=te_intact_239;name=TIR_12;classification=Unknown;tir=NA;tsd=TA;tsd_len=2 CP072753.1 HiTE TIR 1154723 1156154 . - . id=te_intact_93;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TATA;tsd_len=4 CP072753.1 HiTE Helitron 1186805 1187859 . + . id=te_intact_1;name=Helitron_1;classification=RC/Helitron;hairpin_loop=NA CP072753.1 HiTE TIR 1201653 1208017 . + . id=te_intact_285;name=TIR_5;classification=RC/Helitron;tir=NA;tsd=TA;tsd_len=2 CP072753.1 HiTE TIR 1230062 1236429 . + . id=te_intact_286;name=TIR_5;classification=RC/Helitron;tir=NA;tsd=AT;tsd_len=2 CP072753.1 HiTE TIR 1266567 1273155 . - . id=te_intact_245;name=TIR_12;classification=Unknown;tir=NA;tsd=TTAT;tsd_len=4 CP072753.1 HiTE TIR 1282955 1289770 . - . id=te_intact_36;name=TIR_7;classification=RC/Helitron;tir=NA;tsd=TGAAGAGGGCC;tsd_len=11 CP072753.1 HiTE TIR 1300401 1306986 . - . id=te_intact_244;name=TIR_12;classification=Unknown;tir=NA;tsd=TTTT;tsd_len=4 CP072753.1 HiTE TIR 1310256 1311750 . - . id=te_intact_92;name=TIR_6;classification=DNA/TcMar;tir=NA;tsd=TA;tsd_len=2

Thank you very much for your response. As you mentioned, I reran the software and obtained the results in HiTE.gff and HiTE_intact.sorted.gff3. However, when I opened HiTE_intact.sorted.gff3, all the identified TE types were TIR, which is unexpected since LTR retrotransposons are predominant in this species. Could this be an issue with the software? Although I followed the installation instructions, there did not appear to be any error messages during the execution.

CSU-KangHu commented 2 days ago

Could you run ls -alh in the output directory so I can check the details of the output files?

wilson1990D commented 2 days ago

drwxr-xr-x 2 guo 4.0K 11/28 08:48 . drwxr-xr-x 5 guo 4.0K 11/28 08:40 .. -rw-r--r-- 1 guo 136 11/28 08:48 chr_name.map -rw-r--r-- 1 guo 3.2K 11/28 08:47 confident_helitron_0.fa -rw-r--r-- 1 guo 3.1K 11/28 08:47 confident_helitron.fa -rw-r--r-- 1 guo 188K 11/28 08:45 confident_ltr_cut.fa -rw-r--r-- 1 guo 0 11/28 08:45 confident_ltr.internal.fa -rw-r--r-- 1 guo 0 11/28 08:45 confident_ltr.terminal.fa -rw-r--r-- 1 guo 0 11/28 08:47 confident_non_ltr_0.fa -rw-r--r-- 1 guo 0 11/28 08:47 confident_non_ltr.fa -rw-r--r-- 1 guo 0 11/28 08:45 confident_other.fa -rw-r--r-- 1 guo 223K 11/28 08:47 confident_TE.cons.fa -rw-r--r-- 1 guo 45K 11/28 08:46 confident_tir_0.fa -rw-r--r-- 1 guo 35K 11/28 08:47 confident_tir.fa -rw-r--r-- 1 guo 36M 11/28 08:48 genome.rename.fa -rw-r--r-- 1 guo 693K 11/28 08:48 HiTE.gff -rw-r--r-- 1 guo 45K 11/28 08:48 HiTE_intact.sorted.gff3 -rw-r--r-- 1 guo 888K 11/28 08:48 HiTE.out -rw-r--r-- 1 guo 2.4K 11/28 08:48 HiTE.tbl -rw-r--r-- 1 guo 2.6M 11/28 08:44 intact_LTR.fa -rw-r--r-- 1 guo 2.6M 11/28 08:45 intact_LTR.fa.classified -rw-r--r-- 1 guo 28K 11/28 08:45 intact_LTR.list -rw-r--r-- 1 guo 6.8M 11/28 08:45 longest_repeats_0.fa -rw-r--r-- 1 guo 7.2M 11/28 08:45 longest_repeats_0.flanked.fa -rw-r--r-- 1 guo 38K 11/28 08:47 TE_merge_tmp.fa.classified Thank you very much for your concern. Here is the output after running ls -alh in the result output folder. Please take a look.

CSU-KangHu commented 2 days ago

Could you please package the file so that I can download and review it?

wilson1990D commented 2 days ago

HiTE-res.zip Here are the result files for your review. Thank you very much for your valuable time.

CSU-KangHu commented 2 days ago

Hi @wilson1990D,

This bug occurred due to our recent replacement of the LTR detection module, which caused a formatting inconsistency when generating the LTR full-length annotation.

I have already updated the latest version of the code. Please download the updated version, which should resolve the issue.

Thank you again for using our tool!

wilson1990D commented 1 day ago

Thank you to the author for your help. Despite reinstalling several times and adjusting the versions of pandas and Python, the same error still occurs. Strangely, the process runs without any error messages, which makes me very puzzled.

CSU-KangHu commented 1 day ago

Hi @wilson1990D,

First of all, I sincerely apologize for the inconvenience. As we are currently updating and upgrading HiTE, unexpected issues may arise. Thank you so much for using and testing HiTE, and for bringing this to our attention.

Upon checking, I found that the issue was caused by a file name change that wasn’t updated accordingly in the code. However, I ran a test yesterday and was able to generate results successfully, which is quite puzzling.

Regardless, I have updated the code and tested it again using the genome.rename.fa file from your provided HiTE-res directory. All necessary files were successfully generated. Both confident_non_ltr.fa and confident_other.fa are empty, indicating that HiTE did not identify any non-LTR elements in this genome. The HiTE_intact.sorted.gff3 file contains the full-length LTR annotations identified. Please download the latest version of the code and try again.

image
image

Additionally, in the latest environment.yml, we’ve updated to a newer version of TensorFlow, along with compatible versions of NumPy and pandas. This should resolve any issues related to pandas or Python versions.

Best regards,
Kang

wilson1990D commented 1 day ago

Thank you so much for your selfless help. With your guidance, under your update, I have achieved the same results as you, which should be correct. I am deeply grateful, and I hope more people can use HiTE to help even more people.

CSU-KangHu commented 1 day ago

Thanks!!!