Open Giacomoggioli4111 opened 4 years ago
Hi, thanks for trying purge_dups
please use the latest release and repace $pri_asm
with $ref_genome
in /bin/get_seqs dups.bed $pri_asm
in your script.
Dengfeng.
Oh! This was an avoidable error... Thanks for the suggestion, I will let you know how it goes. Bests, Giacomo
Purge_dups produced an output this time, but I think something else went wrong. Before the pipeline my genome has 2410 contigs and the file is 1.6G, after the pipeline I have obtained purged.fa which is 1.5G and contains 2218 contigs and hap.fa which is 12M and contains 200 contigs. Considering that Busco duplication is around 80% I would expect to obtain a purged.fa file around 950M. What do you think about this? Thank you very much, Giacomo
Hi Giacomo,
I guess the cutoffs generated by purge_dups can be wrong, could you please share the coverage plot with me? Just use hist_plot script in purge_dups scripts directory, the command should be "python3 hist_plot -c cutoffs PB.stat PB.png"
Cheers,
Dengfeng.
On Fri, Feb 28, 2020 at 9:50 PM Giacomoggioli4111 notifications@github.com wrote:
Purge_dups produced an output this time, but I think something else went wrong. Before the pipeline my genome has 2410 contigs and the file is 1.6G, after the pipeline I have obtained purged.fa which is 1.5G and contains 2218 contigs and hap.fa which is 12M and contains 200 contigs. Considering that Busco duplication is around 80% I would expect to obtain a purged.fa file around 950M. What do you think about this? Thank you very much, Giacomo
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dfguan/purge_dups/issues/26?email_source=notifications&email_token=ABORPAEEIQ2ZGKJKVNGSZF3RFEJBJA5CNFSM4K3F3R6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENISOFQ#issuecomment-592520982, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABORPAHETHCIA75HLDOHS4LRFEJBJANCNFSM4K3F3R6A .
Hi Dengfeng,
I am sending you the PB.png graph.
Cheers,
Giacomo
Hi Giacomo, the coverage cutoffs seem to be good, your host.split.self.paf.gz is too small to be true for a 1.6 Gb genome, it think it may not be finished successfully. Could you please rerun minimap2 self aligment again? Dengfeng.
Hi Dengfeng,
Minimap2 self alignment produced the same output as before. I am sending you the log:
[M::mm_idx_gen::51.491*1.43] collected minimizers
[M::mm_idx_gen::62.685*1.65] sorted minimizers
[M::main::62.686*1.65] loaded/built the index for 2410 target sequence(s)
[M::mm_mapopt_update::66.141*1.62] mid_occ = 428
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 2410
[M::mm_idx_stat::67.648*1.61] distinct minimizers: 61823830 (55.78% are singletons); average occurrences: 3.093; average spacing: 8.365
[M::worker_pipeline::2241.710*2.86] mapped 410 sequences
[M::worker_pipeline::4281.427*2.92] mapped 761 sequences
[M::worker_pipeline::6894.897*2.94] mapped 886 sequences
[M::worker_pipeline::7563.985*2.90] mapped 353 sequences
[M::main] Version: 2.5-r574-dirty
[M::main] CMD: minimap2 -x asm5 -DP /path/host.split.fasta /path/host.split.fasta
[M::main] Real time: 7564.098 sec; CPU: 21947.003 sec
What should I try next? Thank you very much!
Cheers,
Giacomo
Wow, unbelievable. Is it possible to share the self.paf file or the assembly with me? So I can run by myself and check what have happened? Dengfeng.
Yeah, sure. How can I send it to you? I can send you a download link via email, if it is ok for you.
Please send me a link. Thanks, Dengfeng.
On Wed, Mar 4, 2020 at 7:56 PM Giacomoggioli4111 notifications@github.com wrote:
Yeah, sure. How can I send it to you? I can send you a download link via email, if it is ok for you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dfguan/purge_dups/issues/26?email_source=notifications&email_token=ABORPAFLKBO653LFIQWNBD3RFY6XNA5CNFSM4K3F3R6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENXQNVY#issuecomment-594478807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABORPAALEURYEV5CI27EBZ3RFY6XNANCNFSM4K3F3R6A .
I have just sent you an email on your gmail account. Bests, Giacomo
Hi Giacomo, it seems you are not using the latest minimap2. Could you please redo the whole pipeline with the latest minimap2 (https://github.com/lh3/minimap2/releases)? Dengfeng.
Hi Denfeng, I have followed your suggestions and I have obtained good results! What is your opinion about this? Do you think the Busco completeness can be further improved?
Species | Busco | Genome size | N50 |
---|---|---|---|
1 before | C:96.6%[S:11.7%,D:84.9%] | 1.6 Gb | 1.8 Mb |
1 after | C:96.4%[S:93.3%,D:3.1%] | 814 Mb | 2.97 Mb |
2 before | C:80.0%[S:21.5%,D:58.5%] | 596 Mb | 311 Kb |
2 after | C:79.3%[S:78.2%,D:1.1%] | 292 Mb | 423 Kb |
Betsts,
Giacomo
Hi Giacomo, the results look good. You can try to do polishing to pull some genes back. Dengfeng.
Hello,
I am trying to run the manual pipeline, because the cluster I am working on (ubuntu based) is not very permissive and I was having problems with the "automated" pipeline. I have obtained a 1.6Gb assembly with Canu 1.8 using 120x coverage PacBio data. Busco duplicated genes are the 82.2%. At the end the program produces an empty file. I am sending you the script I have used to run the pipeline and the content of my working directory, do you think I am doing some errors, if so, what can I improve?
Best regards,
my script:
my working directory: