aaranyue / quarTeT

A telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification
http://atcgn.com:8080/quarTeT/home.html
101 stars 7 forks source link

endless slurm work problem #19

Closed cyycyj closed 7 months ago

cyycyj commented 1 year ago

Dear developer,

I try to use quartet_centrominer.py to do centromere candidate prediction. My plant genome is about 500Mb with 18 chromosomes, and I have submitted the following script to the Slurm scheduler:

`#!/bin/bash

SBATCH --job-name=quartet_centrominer

SBATCH --partition acPartition

SBATCH --cpus-per-task=128

SBATCH --ntasks-per-node=1

SBATCH -o quartet_centrominer.out

SBATCH -e quartet_centrominer.err

SBATCH --nodes=1

source /home/bio/.bashrc

source activate /home/bio/miniconda3/envs/mummer4

cd /home/bio/data/quertet-centrominer

python3 /home/bio/biosoft/quarTeT-1.1.5/quartet_centrominer.py -i ../4S_chr.fa -p cent_out -n 100 -m 200 -t 128`

The job appears to be running for an unusually long time, exceeding 4 hours, and is utilizing a significant number of CPU cores (128 cores) and memory (256GB). Additionally, both the standard and error output files are empty, and there seem to be no changes in the files within the output directory (tmp/, candidate/, TRfasta/, and TRgff3/) for 2 hours. Could you please help me understand what might be happening and how I can address this issue?

Best regards,

Andrew

Echoring commented 1 year ago

It looks like called trf stuck. It has been reported that trf may cost extremely long time (for days) to solve some complex chromosomes. trf do not support multiprocess, and the thread defined here is for split the genome to chromosomes and parallel compute them. If one chromosome takes long time, increasing thread has no help. Breaking chromosome into several parts may be of help, but make sure not break at the repetitive region.

cyycyj commented 1 year ago

Thank you very much! Hope you can solve it, maybe can use another tool instead of trf to fix it?

I have another question about scaffolding. When I use hifiasm to assemble primary contigs, I've noticed that there is a quite long contig (over 60Mb, nearly 1.5 times the length of the longest chromosomes which are about 40Mb) in the p_ctg.fa. When I use endhic scaffolding, it gives me an incorrect number of chromosomes. Should I also use hap1.p_ctg.fa and hap2.p_ctg.fa as input for primary contig scaffolding?

By the way, I must say, quartet is a really great and cool tool, especially the gap-filling feature. I also discovered that Quartet has a meaning in the music field. I truly appreciate the romantic feeling that combines computer science and art. I will definitely recommend it to others.

Echoring commented 1 year ago

In this case, it looks like an assembly error. If p_ctg has this error, hapX.p_ctg are likely to have the same. Using p_utg may be of help, but this results in more fragmented sequences. If this is two chromosomes taken together, breaking them according to Hi-C contact is also an option. And yes, you got the point! This name also contains the meaning of four module works together toward T2T target, cooperating like a quartet band.

cyycyj commented 1 year ago

Yeah, but actually the contigs of 'hapX.p_ctg' seem ok, and the results of endhic scaffolding do make sense, even when compared to the previously published genome. I think maybe it's due to the high heterozygosity (about 1.29%, 2n=30) and the complexity of the chromosomes, as you mentioned before?

Echoring commented 1 year ago

Sounds reasonable. High heterozygosity may affect many program's performance.

cyycyj commented 10 months ago

Hello,

I found that v1.1.6 it is a epic update, quartet_centrominer.py can run much faster. But when I use the RepeatMasker's output gff3 file as the --TE input (as below), it seem that quartet_centrominer.py can not use or identify the TE information of it. Could you please tell me what has happened?

example of the RepeatMasker's output gff3 file:

##gff-version 3
##sequence-region chr1 1 39963625
chr1    RepeatMasker    dispersed_repeat        1       9024    8942    +       .       ID=1;Target=(CCCTAAA)n 1 9078
chr1    RepeatMasker    dispersed_repeat        9045    10331   4245    -       .       ID=2;Target=rnd-3_family-624 93 1248
chr1    RepeatMasker    dispersed_repeat        10331   10415   360     -       .       ID=3;Target=rnd-3_family-624 682 773
chr1    RepeatMasker    dispersed_repeat        10372   10618   678     -       .       ID=4;Target=ltr-1_family-454 5683 5764
chr1    RepeatMasker    dispersed_repeat        10484   12495   13276   +       .       ID=5;Target=ltr-1_family-104 4987 6898
chr1    RepeatMasker    dispersed_repeat        12447   12502   370     +       .       ID=6;Target=rnd-1_family-147 11 66
chr1    RepeatMasker    dispersed_repeat        12498   13899   4361    -       .       ID=2;Target=rnd-3_family-624 16 1315
chr1    RepeatMasker    dispersed_repeat        13902   14320   2242    -       .       ID=7;Target=rnd-1_family-339 1 414
chr1    RepeatMasker    dispersed_repeat        14321   14348   949     +       .       ID=8;Target=rnd-1_family-243 388 399
chr1    RepeatMasker    dispersed_repeat        14350   15088   2316    +       .       ID=9;Target=ltr-1_family-536 2652 3274
chr1    RepeatMasker    dispersed_repeat        14910   15324   2656    -       .       ID=10;Target=rnd-1_family-353 131 558
chr1    RepeatMasker    dispersed_repeat        14952   15742   2176    +       .       ID=11;Target=rnd-1_family-55 75 874
chr1    RepeatMasker    dispersed_repeat        15740   16774   4469    +       .       ID=9;Target=ltr-1_family-536 8129 9151
chr1    RepeatMasker    dispersed_repeat        16775   16808   40      +       .       ID=12;Target=(TA)n 1 34
chr1    RepeatMasker    dispersed_repeat        16809   17097   4469    +       .       ID=9;Target=ltr-1_family-536 9152 9268
chr1    RepeatMasker    dispersed_repeat        17098   17518   2514    -       .       ID=13;Target=rnd-1_family-32 1 434
chr1    RepeatMasker    dispersed_repeat        17428   17974   2127    -       .       ID=14;Target=rnd-1_family-41 1 541
chr1    RepeatMasker    dispersed_repeat        17519   17563   4469    +       .       ID=9;Target=ltr-1_family-536 9269 9287
chr1    RepeatMasker    dispersed_repeat        17974   18412   2413    -       .       ID=15;Target=rnd-3_family-624 679 1121
...
Echoring commented 10 months ago

CentroMiner require that the third column in gff3 file should describe the class of dispersed repeat, to be specific, string including LTR.

cyycyj commented 10 months ago

Thanks! And I would like to say whether you can update the quartet so that it can indentify the output of RepeatMasker, it is a widely used software for repeat annotations.

Echoring commented 10 months ago

I have updated this in pre-release v1.1.7. Give it a try.

cyycyj commented 10 months ago

Wow! I will try it right now, thanks!

cyycyj commented 10 months ago

Dear developer,

I have try the v1.1.7, and it seems fetch the TE information from RepeatMasker's output successfully, because the TElength and TEcoverage can be detected:

# Chr   start   end     length  TRlength        TRcoverage      TElength        TEcoverage      ragionscore
#       subTR   period  subTRlength     subTRcoverage   pattern
chr1    39137190        39942409        805220  640447  79.54%  73099   9.08%   0.8044481066641498
        chr1@TR_00936   188     582837  72.38%  GTTAGTAAGGGAAATTTGAGCAAAATTAGAAAACTCGTGTATTACACCCAGAAACGCGATTCGACTGAAAACCTTGTTATGGAACTGCTAGAAATACTCTATTTTATCCATGAGGGACATCTAGGGTCATTCCGAGCGCAACGCGCGGTCATTCCTAGACCATAAAAAAATCAATAATTTTCGTAGGG
        chr1@TR_00932   187     520993  64.7%   CATTCCGAGACCATAAAAAAATCAACAATTTTCATAGGGGTTAGTAAGGAAATTTGAGCAAAATTAGAAAACTCGTATATTACACTCAGAAACGCGATTCGACTGAAAAACTTGTTATGGAACTTCTAGAAATACTCTATTTTATCCATGAGGGACATCTAGGGTCATTCCGAGCACAACGCGCAAT
        chr1@TR_00952   180     376665  46.78%  AACTGCTAGAAATACTCTATTATATCCATGAGTGAAATCTGGGGTCATTCCGAGCGGATCATCCCGAGACAATAAAAAAATCTAAAATTTTCATAGGGGTTAGTAAGGGAAATTTGAGCAAAATTAGAAAAGTCGTATATTACACTGAGAAACGTGATTCGACTGAAAACCTTGTTATGG

But there is still a weird result as below. As you can see, the identified centromere still on the end of the chromosome. could you please give me some advice on it? Thanks

test

Echoring commented 10 months ago

Sometimes other tandem-repeat-rich area may be scored higher than centromere. You can check candidate folder, and find whether second-best or later scored area represent a reasonable result.

cyycyj commented 8 months ago

thank you for your answer. and I would like to ask how to set -r properly when I am running quartet_centrominer.py?

Echoring commented 8 months ago

sorry, but I remembered there is no -r option in CentroMiner module?

Sorry, I forgot to add this new parameter in documents. You can check help in command line. This parameter set in millions, default is 3.