Statistical problems with the output file

RY-Zheng commented 1 month ago

Dear developers, I have run HiTE and output multiple files. These include HiTE.out and HiTE.detail.tbl, but I manually checked the number of sequences in HiTE.out and found it was inconsistent with the number recorded in HiTE.detail.tbl, and the number of HiTE.out records was higher. Could you please answer this question? Also, is there a way to convert HiTE.out to annotation files so that I can extract specific sequences, such as all LTR/Gypsy sequences? Or is there a way to extract sequences based on the statistic results of ITE.detail.tbl or ITE.tbl?

$grep -c "LTR/Gypsy" HiTE.out 1340604

HiTE.detail.tbl： Class Count bpMasked %masked ===== ===== ======== ======= DNA -- -- --
CMC-EnSpm 2396 692283 0.03% Crypton 822 246074 0.01% MULE 93620 19726548 0.74% Merlin 2169 816875 0.03% PIF-Harbinger 95019 20725099 0.77% TcMar 222014 84862829 3.17% hAT 35133 7498474 0.28% LINE -- -- --
L1 70974 84627363 3.16% LTR -- -- --
Copia 194336 84748164 3.16% ERV 83462 56049195 2.09% Gypsy 1123194 1771812252 66.17% RC -- -- --
Helitron 172698 49453545 1.85% SINE -- -- --
tRNA 16209 2247155 0.08% Unknown 121537 31955024 1.19%

total interspersed 2233583      2215460880   82.73%

CSU-KangHu commented 1 month ago

It might be due to the merging of overlap records in HiTE.detail.tbl from HiTE.out. You can refer to the details implemented in RepeatMasker’s buildSummary.pl.

If you specify --annotate 1, it should also generate theHiTE.gff file.

RY-Zheng commented 1 month ago

It might be due to the merging of overlap records in HiTE.detail.tbl from HiTE.out. You can refer to the details implemented in RepeatMasker’s buildSummary.pl.

If you specify --annotate 1, it should also generate theHiTE.gff file.

I specified '--annotate 1' and also generated 'HiTE.gff' file. But the HiTE.gff file doesn't seem to have a specific classification, such as LTR/Gypsy. At the same time, the number of TE in theHiTE.gff file and HiTE.out file is consistent, but it is still inconsistent with the number of HiTE.tbl and HiTE.detail.tbl. Is there a filtering mechanism for HiTE.tbl and HiTE.detail.tbl? I want to extract all the sequences of HiTE.tbl and HiTE.detail.tbl records. Thank you very much

HiTE.gff :

gff-version 2

date 2024-10-08

sequence-region cyl.hap1.genome.chrom_level.fa

Chr01.1 RepeatMasker similarity 26 10844 0.4 + . Target "Motif:(CTAAACC)n" 1 10799 Chr01.1 RepeatMasker similarity 10845 10998 7.2 - . Target "Motif:TIR_617" 366 526 Chr01.1 RepeatMasker similarity 10859 11011 6.9 + . Target "Motif:TIR_1" 40 203 Chr01.1 RepeatMasker similarity 11013 11287 9.8 - . Target "Motif:LTR_1087_LTR" 1419 1577 Chr01.1 RepeatMasker similarity 11270 11378 11.8 + . Target "Motif:TIR_699" 257 350 Chr01.1 RepeatMasker similarity 11380 12150 9.2 + . Target "Motif:(ATTATGAC)n" 1 795 Chr01.1 RepeatMasker similarity 12151 12450 11.8 + . Target "Motif:TIR_699" 351 672 Chr01.1 RepeatMasker similarity 12451 12503 11.4 - . Target "Motif:TIR_229" 865 1229 Chr01.1 RepeatMasker similarity 12504 12609 3.5 + . Target "Motif:TIR_435" 804 1280 Chr01.1 RepeatMasker similarity 12610 12684 8.5 + . Target "Motif:(ATTATGAC)n" 1 75 Chr01.1 RepeatMasker similarity 12686 12909 10.3 - . Target "Motif:TIR_1" 418 614 Chr01.1 RepeatMasker similarity 12910 13264 12.6 + . Target "Motif:LTR_1087_LTR" 1043 1424 Chr01.1 RepeatMasker similarity 12988 13062 9.2 - . Target "Motif:TIR_435" 781 1306 Chr01.1 RepeatMasker similarity 13063 13482 16.4 + . Target "Motif:TIR_229" 354 777 Chr01.1 RepeatMasker similarity 13483 13631 10.3 + . Target "Motif:(TCATAATG)n" 1 151

CSU-KangHu commented 1 month ago

You can follow this guide (https://www.biostars.org/p/382150/) to convert the .out file to a .gff file. In fact, both the .out and .tbl files are generated by RepeatMasker, as invoked by HiTE. Therefore, to understand the specific correspondence between the .out and .tbl files, you may need to refer to the RepeatMasker code.

RY-Zheng commented 1 month ago

Ok, thank you. What I want to ask is how to extract the same number of sequences as recorded in HiTE.detail.tbl. Just like HiTE.detail.tbl: Class Count bpMasked %masked ===== ===== ======== ======= DNA -- -- -- CMC-EnSpm 2396 692283 0.03% Crypton 822 246074 0.01% MULE 93620 19726548 0.74% Merlin 2169 816875 0.03% PIF-Harbinger 95019 20725099 0.77% TcMar 222014 84862829 3.17% hAT 35133 7498474 0.28% LINE -- -- -- L1 70974 84627363 3.16% LTR -- -- -- Copia 194336 84748164 3.16% ERV 83462 56049195 2.09% Gypsy 1123194 1771812252 66.17% Here is a record of the number of LTR/Gypsy as 1123194. How can I extract the complete sequence of these numbers?

Thank you

CSU-KangHu commented 1 month ago

I'm not sure why the number of records in the .detail.tbl file is inconsistent with that in the .out file, as this is generated using RepeatMasker's buildSummary.pl, as I mentioned.

If you want to extract LTR/Gypsy sequences, you can easily extract the corresponding sequences based on the .gff file and calculate metrics like Count and bpMasked yourself.

CSU-KangHu / HiTE