dfguan / purge_dups

haplotypic duplication identification tool
MIT License
213 stars 22 forks source link

empty purged.fa #26

Open Giacomoggioli4111 opened 4 years ago

Giacomoggioli4111 commented 4 years ago

Hello,

I am trying to run the manual pipeline, because the cluster I am working on (ubuntu based) is not very permissive and I was having problems with the "automated" pipeline. I have obtained a 1.6Gb assembly with Canu 1.8 using 120x coverage PacBio data. Busco duplicated genes are the 82.2%. At the end the program produces an empty file. I am sending you the script I have used to run the pipeline and the content of my working directory, do you think I am doing some errors, if so, what can I improve?

Best regards,

my script:

#!/bin/bash
#$ -wd /wd/
#$ -j y
#$ -o /wd/
#$ -pe smp 12
#$ -l h_vmem=30G
#$ -l h_rt=240:0:0
#$ -l highmem

#variables
ref_genome=/path/host.fasta #1.6G
pb_fasta=/path/pb_raw.fastq.gz #34G
ref_genome_split=/wd/host.split.fasta
self_aln_genome=/wd/host.split.self.paf.gz

module load python
module load minimap2

cd /wd/

echo 'MINIMAP2 ______________________________________________________________________'
minimap2 -x map-pb $ref_genome $pb_fasta | gzip -c - > purge.paf.gz

echo 'PBCSTAT _______________________________________________________________________'
/bin/pbcstat *.paf.gz #(produces PB.base.cov and PB.stat files)

echo 'CALCUTS _______________________________________________________________________'
/bin/calcuts PB.stat > cutoffs 2> calcults.log

echo 'SPLIT_FA ______________________________________________________________________'
/bin/split_fa $ref_genome > $ref_genome_split

echo 'MINIMAP2 ______________________________________________________________________'
minimap2 -x asm5 -DP $ref_genome_split $ref_genome_split | gzip -c - > $self_aln_genome

echo 'PURGE_DUPS ____________________________________________________________________'
/bin/purge_dups -2 -T cutoffs -c PB.base.cov $self_aln_genome > dups.bed 2> purge_dups.log

echo 'GET_SEQS ______________________________________________________________________'
/bin/get_seqs dups.bed $pri_asm > purged.fa 2> hap.fa

my working directory:

-rw-r--r-- 1 user  195 Feb 22 13:40 calcults.log
-rw-r--r-- 1 user   19 Feb 22 13:40 cutoffs
-rw-r--r-- 1 user  16K Feb 22 14:48 dups.bed
-rw-r--r-- 1 user  514 Feb 22 14:48 hap.fa
-rw-r--r-- 1 user 1.5G Feb 22 13:40 host.split.fasta
-rw-r--r-- 1 user 138K Feb 22 14:48 host.split.self.paf.gz
-rw-r--r-- 1 user 259M Feb 22 13:40 PB.base.cov
-rw-r--r-- 1 user 4.9M Feb 22 13:40 PB.cov.wig
-rw-r--r-- 1 user 5.2K Feb 22 13:40 PB.stat
-rw-r--r-- 1 user    0 Feb 22 14:48 purged.fa
-rw-r--r-- 1 user  15K Feb 22 14:48 purge_dups.log
-rw-r--r-- 1 user  16K Feb 22 14:48 purge_dups_v1.2.sh.o999262
-rw-r--r-- 1 user 603M Feb 22 13:39 purge.paf.gz
dfguan commented 4 years ago

Hi, thanks for trying purge_dups please use the latest release and repace $pri_asm with $ref_genome in /bin/get_seqs dups.bed $pri_asm in your script.

Dengfeng.

Giacomoggioli4111 commented 4 years ago

Oh! This was an avoidable error... Thanks for the suggestion, I will let you know how it goes. Bests, Giacomo

Giacomoggioli4111 commented 4 years ago

Purge_dups produced an output this time, but I think something else went wrong. Before the pipeline my genome has 2410 contigs and the file is 1.6G, after the pipeline I have obtained purged.fa which is 1.5G and contains 2218 contigs and hap.fa which is 12M and contains 200 contigs. Considering that Busco duplication is around 80% I would expect to obtain a purged.fa file around 950M. What do you think about this? Thank you very much, Giacomo

dfguan commented 4 years ago

Hi Giacomo,

I guess the cutoffs generated by purge_dups can be wrong, could you please share the coverage plot with me? Just use hist_plot script in purge_dups scripts directory, the command should be "python3 hist_plot -c cutoffs PB.stat PB.png"

Cheers,

Dengfeng.

On Fri, Feb 28, 2020 at 9:50 PM Giacomoggioli4111 notifications@github.com wrote:

Purge_dups produced an output this time, but I think something else went wrong. Before the pipeline my genome has 2410 contigs and the file is 1.6G, after the pipeline I have obtained purged.fa which is 1.5G and contains 2218 contigs and hap.fa which is 12M and contains 200 contigs. Considering that Busco duplication is around 80% I would expect to obtain a purged.fa file around 950M. What do you think about this? Thank you very much, Giacomo

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dfguan/purge_dups/issues/26?email_source=notifications&email_token=ABORPAEEIQ2ZGKJKVNGSZF3RFEJBJA5CNFSM4K3F3R6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENISOFQ#issuecomment-592520982, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABORPAHETHCIA75HLDOHS4LRFEJBJANCNFSM4K3F3R6A .

Giacomoggioli4111 commented 4 years ago

Hi Dengfeng,

I am sending you the PB.png graph.

Cheers,

Giacomo PB

dfguan commented 4 years ago

Hi Giacomo, the coverage cutoffs seem to be good, your host.split.self.paf.gz is too small to be true for a 1.6 Gb genome, it think it may not be finished successfully. Could you please rerun minimap2 self aligment again? Dengfeng.

Giacomoggioli4111 commented 4 years ago

Hi Dengfeng,

Minimap2 self alignment produced the same output as before. I am sending you the log:

[M::mm_idx_gen::51.491*1.43] collected minimizers
[M::mm_idx_gen::62.685*1.65] sorted minimizers
[M::main::62.686*1.65] loaded/built the index for 2410 target sequence(s)
[M::mm_mapopt_update::66.141*1.62] mid_occ = 428
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 2410
[M::mm_idx_stat::67.648*1.61] distinct minimizers: 61823830 (55.78% are singletons); average occurrences: 3.093; average spacing: 8.365
[M::worker_pipeline::2241.710*2.86] mapped 410 sequences
[M::worker_pipeline::4281.427*2.92] mapped 761 sequences
[M::worker_pipeline::6894.897*2.94] mapped 886 sequences
[M::worker_pipeline::7563.985*2.90] mapped 353 sequences
[M::main] Version: 2.5-r574-dirty
[M::main] CMD: minimap2 -x asm5 -DP /path/host.split.fasta /path/host.split.fasta
[M::main] Real time: 7564.098 sec; CPU: 21947.003 sec

What should I try next? Thank you very much!

Cheers,

Giacomo

dfguan commented 4 years ago

Wow, unbelievable. Is it possible to share the self.paf file or the assembly with me? So I can run by myself and check what have happened? Dengfeng.

Giacomoggioli4111 commented 4 years ago

Yeah, sure. How can I send it to you? I can send you a download link via email, if it is ok for you.

dfguan commented 4 years ago

Please send me a link. Thanks, Dengfeng.

On Wed, Mar 4, 2020 at 7:56 PM Giacomoggioli4111 notifications@github.com wrote:

Yeah, sure. How can I send it to you? I can send you a download link via email, if it is ok for you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dfguan/purge_dups/issues/26?email_source=notifications&email_token=ABORPAFLKBO653LFIQWNBD3RFY6XNA5CNFSM4K3F3R6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENXQNVY#issuecomment-594478807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABORPAALEURYEV5CI27EBZ3RFY6XNANCNFSM4K3F3R6A .

Giacomoggioli4111 commented 4 years ago

I have just sent you an email on your gmail account. Bests, Giacomo

dfguan commented 4 years ago

Hi Giacomo, it seems you are not using the latest minimap2. Could you please redo the whole pipeline with the latest minimap2 (https://github.com/lh3/minimap2/releases)? Dengfeng.

Giacomoggioli4111 commented 4 years ago

Hi Denfeng, I have followed your suggestions and I have obtained good results! What is your opinion about this? Do you think the Busco completeness can be further improved?

Species Busco Genome size N50
1 before C:96.6%[S:11.7%,D:84.9%] 1.6 Gb 1.8 Mb
1 after C:96.4%[S:93.3%,D:3.1%] 814 Mb 2.97 Mb
2 before C:80.0%[S:21.5%,D:58.5%] 596 Mb 311 Kb
2 after C:79.3%[S:78.2%,D:1.1%] 292 Mb 423 Kb

Betsts,

Giacomo

dfguan commented 4 years ago

Hi Giacomo, the results look good. You can try to do polishing to pull some genes back. Dengfeng.