bergmanlab / mcclintock

Meta-pipeline to identify transposable element insertions using next generation sequencing data
93 stars 30 forks source link

Repeate make annotation run after a preprocessing for --make_annotation #124

Open craftor18 opened 7 months ago

craftor18 commented 7 months ago

Hi, I am using this the --make_annotaion preprocessing to run a multiple samples TE detecting,but after I ran out the --make_annotation,then I use -1 -2 to add sample,and find a repeat annotation step for relocate2 and other steps,which means RepeatMasker is progressed multiple times for every sample. Cound you help me? Below is my annotaion step code :

nohup python3 ~/software/mcclintock/mcclintock.py -c ref/TElib.fa -r ref/C.auratus.chromosome_20210819.fasta -p 80 -o output_template_all/   --serial --keep_intermediate all  --make_annotations > logs/make_annotation.log &

Then I add a sample for resume:

nohup python3 ~/software/mcclintock/mcclintock.py -r ref/C.auratus.chromosome_20210819.fasta -c ref/TElib.fa -1 input_fastq/bc17_1.fastq.gz -2 input_fastq/bc17_2.fastq.gz -p 40 -m relocate,TEMP,ngs_te_mapper  -o ./output_template_all/  --resume    > logs/bc17_template.log &

I find RepeatMasker and bwa index steps were re-run ,could you please tell me why? Thanks Best wishes!

craftor18 commented 7 months ago

Below that, I find if I directly use the tsv and gff file for input and create a new fold for sample TE detect,It wiil be always ran the mcclintock.py and will not go next

craftor18 commented 7 months ago
-rw-rw-r-- 1 zengy zengy 1.2K May  7 00:06 bc17.log
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/mcclintock/logs$ cat bc17.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/C.auratus.chromosome_20210819.fasta
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/mcclintock/input_fastq/bc17_1.fastq.gz
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/mcclintock/input_fastq/bc17_2.fastq.gz
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/TElib.fa
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
SETUP            Checking config files to ensure previous intermediate files are compatible with this run
Job counts:
        count   jobs
        1       index_reference_genome
        1       make_ref_te_bed
        1       make_reference_fasta
        1       make_te_annotations
        1       map_reads
        1       median_insert_size
        1       ngs_te_mapper_post
        1       ngs_te_mapper_run
        1       process_temp
        1       reference_2bit
        1       relocaTE_consensus
        1       relocaTE_post
        1       relocaTE_ref_gff
        1       relocaTE_run
        1       run_temp
        1       sam_to_bam
        1       setup_reads
        1       summary_report
        1       telocate_taxonomy
        19
PROCESSING       formatting the name of consensus TE fasta headers for compatibility with relocaTE
PROCESSING       relocaTE consensus fasta created
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/mcclintock/logs$ cat make_annotation.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/C.auratus.chromosome_20210819.fasta
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/TElib.fa
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
Job counts:
        count   jobs
        1       make_consensus_fasta
        1       make_reference_fasta
        1       make_te_annotations
        3
PROCESSING       making consensus fasta
PROCESSING       consensus fasta created
PROCESSING       making reference fasta
PROCESSING       reference fasta created
PROCESSING       making reference TE annotations
PROCESSING       no reference TEs provided... finding reference TEs with RepeatMasker &> /data/zengy/reseq_C.auratus/work/mcclintock/output_template_all/logs/20240506.163954.7526138/processing.log
PROCESSING       reference TE annotations created

like above, make annotation step and resume sample step has the same processing and make annotation truely has been created all files that I need

cbergman commented 6 months ago

Hi @craftor18

Could you try simplifying your initial --make_annotations execution and using full paths to your directories, e.g.

nohup python3 ~/software/mcclintock/mcclintock.py -r /full/path/to/ref/C.auratus.chromosome_20210819.fasta -c /full/path/to/ref/TElib.fa -p 80 -o /full/path/to/output_template_all/ --make_annotations > /full/path/to/logs/make_annotation.log &
nohup python3 ~/software/mcclintock/mcclintock.py -r /full/path/to/ref/C.auratus.chromosome_20210819.fasta -c /full/path/to/ref/TElib.fa -1 /full/path/to/input_fastq/bc17_1.fastq.gz -2 /full/path/to/input_fastq/bc17_2.fastq.gz -p 40 -m relocate,TEMP,ngs_te_mapper -o /full/path/to/output_template_all/ --resume > /full/path/to/logs/bc17_template.log &

If this doesn't work, can you upload the complete make_annotation.log and bc17_template.log files?

Thanks, Casey

craftor18 commented 6 months ago

Thanks for answering,I'll try a full path.But I do not think its a path problem.Because I have use --make_annotation to generate a output dir and use this dir to resume run for a sample ,But it rerun the RepeatMasker step for generating annotation and I delete it.Maybe I should try another version of mcclintock. Could you please tell me which version should I use? Release version or master version or latency fix version? Now my version is master version but I use the mcclintock.py in latency fix version. Best wishes

craftor18 commented 6 months ago

Hello,I''ve tried another way to prepare my gff file and tsv file .I use EDTA to make gff file and by some command to make my input gff and tsv file like,gff is:

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$ head test.gff
LG01    EDTA    Mutator_TIR_transposon  10618026        10621335        .       .       .       ID=TE_struc_145;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=1;Method=structural;TSD=TGCAGCTGCA_TGCAGCTGCA_100.0;TIR=GCAACTTGCG_CGCAAGTTGC
LG01    EDTA    Mutator_TIR_transposon  15685590        15686152        4775    -       .       ID=TE_homo_272562;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.989;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15688968        15689443        3968    +       .       ID=TE_homo_272566;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.957;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15831079        15831563        3946    -       .       ID=TE_homo_272818;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.953;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15839967        15840529        4754    +       .       ID=TE_homo_272827;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.98;Method=homology
LG01    EDTA    Mutator_TIR_transposon  20330106        20330666        4640    +       .       ID=TE_homo_279690;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.982;Method=homology
LG01    EDTA    Mutator_TIR_transposon  20334906        20335461        4277    +       .       ID=TE_homo_279697;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.949;Method=homology
LG02    EDTA    Mutator_TIR_transposon  11438649        11439221        4686    -       .       ID=TE_homo_390467;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.97;Method=homology
LG02    EDTA    Mutator_TIR_transposon  11439222        11439368        1167    +       .       ID=TE_homo_390468;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.952;Method=homology
LG02    EDTA    Mutator_TIR_transposon  23939864        23940375        3663    +       .       ID=TE_homo_408718;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.967;Method=homology
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$

and tsv is:

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$ head test.tsv
TE_struc_145    TE_00000001
TE_homo_272562  TE_00000001
TE_homo_272566  TE_00000001
TE_homo_272818  TE_00000001
TE_homo_272827  TE_00000001
TE_homo_279690  TE_00000001
TE_homo_279697  TE_00000001
TE_homo_390467  TE_00000001
TE_homo_390468  TE_00000001
TE_homo_408718  TE_00000001

and my run command is :nohup python3 ~/software/mcclintock/mcclintock.py -r ./ref_genome/C.auratus.chromosome_20210819.fasta -c ./ref_genome/test.fa -1 ./00_input_fastq_datas/bc17_1.fastq.gz -2 ./00_input_fastq_datas/bc17_2.fastq.gz -p 8 -m relocate2,temp2,ngs_te_mapper2 -o ./06_mcclintock/ --sample_name bc17 -g ./ref_genome/test.gff -t ./ref_genome/test.tsv > mcclintock_bc17.log & (wd: /data/zengy/reseq_C.auratus/work/non_ref_TE) Why its log is :

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE$ cat mcclintock_bc17.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/C.auratus.chromosome_20210819.fasta
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_1.fastq.gz
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_2.fastq.gz
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.fa
SETUP            checking locations gff: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.gff
SETUP            checking taxonomy TSV: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.tsv
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
Job counts:
        count   jobs
        1       index_reference_genome
        1       make_consensus_fasta
        1       make_ref_te_bed
        1       make_reference_fasta
        1       make_te_annotations
        1       map_reads
        1       median_insert_size
        1       ngs_te_mapper2_post
        1       ngs_te_mapper2_pre
        1       ngs_te_mapper2_run
        1       process_temp2
        1       reference_2bit
        1       relocaTE2_post
        1       relocaTE2_run
        1       repeatmask
        1       run_temp2
        1       sam_to_bam
        1       setup_reads
        1       summary_report
        1       telocate_taxonomy
        20
PROCESSING       making consensus fasta
PROCESSING       consensus fasta created
PROCESSING       making reference fasta
PROCESSING       reference fasta created
PROCESSING       creating 2bit file from reference genome fasta &> /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/logs/20240511.205850.2061338/processing.log
PROCESSING       reference 2bit file created
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver.Run Snakemake with --verbose to see the full solver output for debugging the problem.

Truly its not a mistake,and the program is also running.But I 've supply a gff and a tsv file ,why program still run a repeatmasker progress for re-generating annotation file ? Below is what is running program:

top - 21:07:45 up 2 days, 10:09,  2 users,  load average: 5.35, 4.50, 4.60
Tasks: 817 total,   6 running, 811 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.0 us,  1.3 sy,  0.0 ni, 93.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128393.1 total,   3840.5 free,  10551.6 used, 115048.4 buff/cache
MiB Swap:   8192.0 total,   7804.0 free,    388.0 used. 117841.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
4037324 zengy     20   0  992472 581176   2308 R 100.0   0.4   8:03.51 bwa index /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_20210819/genome_fasta/C.auratus.chromosome_20210819.fasta
4040162 zengy     20   0    3560   1812   1400 R  99.3   0.0   3:30.79 gzip -cd /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_2.fastq.gz
4046688 zengy     20   0  514828 489860   2824 R  30.9   0.4   0:00.94 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4046699 zengy     20   0  514828 490736   3696 R  18.8   0.4   0:00.57 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4037240 zengy     20   0  527052 504164   4904 S  12.5   0.4   3:21.62 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4046716 zengy     20   0  526512 502416   3696 R   4.6   0.4   0:00.14 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+

Can you explain it?Thank you very much ! Best wishes

craftor18 commented 6 months ago

It seems my input gff and tsv file only work for the ngs_mapper2 method

craftor18 commented 6 months ago

And I also find that when the tsv and gff contain too much TE family lines ,Its time to parse paramers will be very long