How is the annotation performance in the large genome (>10G)

haoyongchao commented 2 months ago

I would like to use the pipeline on a large plant genome. Would it be to run separately on chromosomes or directly on the entire genome? Are there any requirements for CPUs and RAM? Have you ever tested it on a large genome? Thanks!!

CSU-KangHu commented 2 months ago

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

haoyongchao commented 2 months ago

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

Thank you for your prompt reply. I am running the pipeline on a 10G plant genome using 100 CPUs.

wjq1981 commented 3 weeks ago

Hi, thanks for developing such a great software. When I run it on top of a 9g sized genome, it feels like nothing ever comes out of it, I've been running it since July 30th and it's been at “2024-07-30 02:18:12,685 - main.py[line:389] - INFO: cd /HiTE/module && python3 / HiTE/module/judge_LTR_transposons.py -g /dev/hdd/wangjq/genome/Ago/09.repeat/HiTE/Ago.fasta --ltrharvest_home /HiTE/bin/LTRHARVEST parallel --ltrfinder_home /HiTE/bin/LTR_FINDER_parallel-master -t 24 --tmp_output_dir /dev/hdd/genome/Ago/repeat/HiTE --recover 1 --miu 7e- 09 --use_NeuralTE 1 --is_wicker 0 --NeuralTE_home /HiTE/bin/NeuralTE --TEClass_home /HiTE/classification”. Can you suggest anything?

CSU-KangHu commented 3 weeks ago

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best,
Kang

wjq1981 commented 3 weeks ago

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best, Kang

Thank you for your prompt response. I will give it a try.

CSU-KangHu commented 2 days ago

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards,
Kang

wjq1981 commented 2 days ago

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards, Kang

Sorry school started today and I'm just now seeing it. The link to it is here.

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Alisma_plantago-aquatica/all_assembly_versions/GCA_963693085.1_laAliPlan1.1/GCA_963693085.1_laAliPlan1.1_genomic.fna.gz

CSU-KangHu / HiTE

How is the annotation performance in the large genome (>10G) #3