Closed cyycyj closed 7 months ago
It looks like called trf
stuck. It has been reported that trf
may cost extremely long time (for days) to solve some complex chromosomes.
trf
do not support multiprocess, and the thread defined here is for split the genome to chromosomes and parallel compute them. If one chromosome takes long time, increasing thread has no help.
Breaking chromosome into several parts may be of help, but make sure not break at the repetitive region.
Thank you very much! Hope you can solve it, maybe can use another tool instead of trf
to fix it?
I have another question about scaffolding. When I use hifiasm
to assemble primary contigs, I've noticed that there is a quite long contig (over 60Mb, nearly 1.5 times the length of the longest chromosomes which are about 40Mb) in the p_ctg.fa.
When I use endhic scaffolding, it gives me an incorrect number of chromosomes. Should I also use hap1.p_ctg.fa
and hap2.p_ctg.fa
as input for primary contig scaffolding?
By the way, I must say, quartet is a really great and cool tool, especially the gap-filling feature. I also discovered that Quartet has a meaning in the music field. I truly appreciate the romantic feeling that combines computer science and art. I will definitely recommend it to others.
In this case, it looks like an assembly error. If p_ctg
has this error, hapX.p_ctg
are likely to have the same. Using p_utg
may be of help, but this results in more fragmented sequences. If this is two chromosomes taken together, breaking them according to Hi-C contact is also an option.
And yes, you got the point! This name also contains the meaning of four module works together toward T2T target, cooperating like a quartet band.
Yeah, but actually the contigs of 'hapX.p_ctg' seem ok, and the results of endhic scaffolding do make sense, even when compared to the previously published genome. I think maybe it's due to the high heterozygosity (about 1.29%, 2n=30) and the complexity of the chromosomes, as you mentioned before?
Sounds reasonable. High heterozygosity may affect many program's performance.
Hello,
I found that v1.1.6 it is a epic update, quartet_centrominer.py can run much faster. But when I use the RepeatMasker's output gff3 file as the --TE
input (as below), it seem that quartet_centrominer.py can not use or identify the TE information of it. Could you please tell me what has happened?
example of the RepeatMasker's output gff3 file:
##gff-version 3
##sequence-region chr1 1 39963625
chr1 RepeatMasker dispersed_repeat 1 9024 8942 + . ID=1;Target=(CCCTAAA)n 1 9078
chr1 RepeatMasker dispersed_repeat 9045 10331 4245 - . ID=2;Target=rnd-3_family-624 93 1248
chr1 RepeatMasker dispersed_repeat 10331 10415 360 - . ID=3;Target=rnd-3_family-624 682 773
chr1 RepeatMasker dispersed_repeat 10372 10618 678 - . ID=4;Target=ltr-1_family-454 5683 5764
chr1 RepeatMasker dispersed_repeat 10484 12495 13276 + . ID=5;Target=ltr-1_family-104 4987 6898
chr1 RepeatMasker dispersed_repeat 12447 12502 370 + . ID=6;Target=rnd-1_family-147 11 66
chr1 RepeatMasker dispersed_repeat 12498 13899 4361 - . ID=2;Target=rnd-3_family-624 16 1315
chr1 RepeatMasker dispersed_repeat 13902 14320 2242 - . ID=7;Target=rnd-1_family-339 1 414
chr1 RepeatMasker dispersed_repeat 14321 14348 949 + . ID=8;Target=rnd-1_family-243 388 399
chr1 RepeatMasker dispersed_repeat 14350 15088 2316 + . ID=9;Target=ltr-1_family-536 2652 3274
chr1 RepeatMasker dispersed_repeat 14910 15324 2656 - . ID=10;Target=rnd-1_family-353 131 558
chr1 RepeatMasker dispersed_repeat 14952 15742 2176 + . ID=11;Target=rnd-1_family-55 75 874
chr1 RepeatMasker dispersed_repeat 15740 16774 4469 + . ID=9;Target=ltr-1_family-536 8129 9151
chr1 RepeatMasker dispersed_repeat 16775 16808 40 + . ID=12;Target=(TA)n 1 34
chr1 RepeatMasker dispersed_repeat 16809 17097 4469 + . ID=9;Target=ltr-1_family-536 9152 9268
chr1 RepeatMasker dispersed_repeat 17098 17518 2514 - . ID=13;Target=rnd-1_family-32 1 434
chr1 RepeatMasker dispersed_repeat 17428 17974 2127 - . ID=14;Target=rnd-1_family-41 1 541
chr1 RepeatMasker dispersed_repeat 17519 17563 4469 + . ID=9;Target=ltr-1_family-536 9269 9287
chr1 RepeatMasker dispersed_repeat 17974 18412 2413 - . ID=15;Target=rnd-3_family-624 679 1121
...
CentroMiner require that the third column in gff3 file should describe the class of dispersed repeat, to be specific, string including LTR
.
Thanks! And I would like to say whether you can update the quartet so that it can indentify the output of RepeatMasker, it is a widely used software for repeat annotations.
I have updated this in pre-release v1.1.7. Give it a try.
Wow! I will try it right now, thanks!
Dear developer,
I have try the v1.1.7, and it seems fetch the TE information from RepeatMasker's output successfully, because the TElength
and TEcoverage
can be detected:
# Chr start end length TRlength TRcoverage TElength TEcoverage ragionscore
# subTR period subTRlength subTRcoverage pattern
chr1 39137190 39942409 805220 640447 79.54% 73099 9.08% 0.8044481066641498
chr1@TR_00936 188 582837 72.38% GTTAGTAAGGGAAATTTGAGCAAAATTAGAAAACTCGTGTATTACACCCAGAAACGCGATTCGACTGAAAACCTTGTTATGGAACTGCTAGAAATACTCTATTTTATCCATGAGGGACATCTAGGGTCATTCCGAGCGCAACGCGCGGTCATTCCTAGACCATAAAAAAATCAATAATTTTCGTAGGG
chr1@TR_00932 187 520993 64.7% CATTCCGAGACCATAAAAAAATCAACAATTTTCATAGGGGTTAGTAAGGAAATTTGAGCAAAATTAGAAAACTCGTATATTACACTCAGAAACGCGATTCGACTGAAAAACTTGTTATGGAACTTCTAGAAATACTCTATTTTATCCATGAGGGACATCTAGGGTCATTCCGAGCACAACGCGCAAT
chr1@TR_00952 180 376665 46.78% AACTGCTAGAAATACTCTATTATATCCATGAGTGAAATCTGGGGTCATTCCGAGCGGATCATCCCGAGACAATAAAAAAATCTAAAATTTTCATAGGGGTTAGTAAGGGAAATTTGAGCAAAATTAGAAAAGTCGTATATTACACTGAGAAACGTGATTCGACTGAAAACCTTGTTATGG
But there is still a weird result as below. As you can see, the identified centromere still on the end of the chromosome. could you please give me some advice on it? Thanks
Sometimes other tandem-repeat-rich area may be scored higher than centromere.
You can check candidate
folder, and find whether second-best or later scored area represent a reasonable result.
thank you for your answer. and I would like to ask how to set -r
properly when I am running quartet_centrominer.py
?
sorry, but I remembered there is no -r
option in CentroMiner module?
Sorry, I forgot to add this new parameter in documents. You can check help in command line. This parameter set in millions, default is 3.
Dear developer,
I try to use quartet_centrominer.py to do centromere candidate prediction. My plant genome is about 500Mb with 18 chromosomes, and I have submitted the following script to the Slurm scheduler:
`#!/bin/bash
SBATCH --job-name=quartet_centrominer
SBATCH --partition acPartition
SBATCH --cpus-per-task=128
SBATCH --ntasks-per-node=1
SBATCH -o quartet_centrominer.out
SBATCH -e quartet_centrominer.err
SBATCH --nodes=1
source /home/bio/.bashrc
source activate /home/bio/miniconda3/envs/mummer4
cd /home/bio/data/quertet-centrominer
python3 /home/bio/biosoft/quarTeT-1.1.5/quartet_centrominer.py -i ../4S_chr.fa -p cent_out -n 100 -m 200 -t 128`
The job appears to be running for an unusually long time, exceeding 4 hours, and is utilizing a significant number of CPU cores (128 cores) and memory (256GB). Additionally, both the standard and error output files are empty, and there seem to be no changes in the files within the output directory (tmp/, candidate/, TRfasta/, and TRgff3/) for 2 hours. Could you please help me understand what might be happening and how I can address this issue?
Best regards,
Andrew