Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
360 stars 53 forks source link

What's the difference between -A and -q 10 in nextgraph_options #164

Closed Yutang-ETH closed 1 year ago

Yutang-ETH commented 1 year ago

Question or Expected behavior How to understand -A and -q options in nextgraph_options? is -A = -q 0?

Operating system CentOS Linux 7

GCC gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

Python Python 3.11.0

NextDenovo version:v2.5.0

Additional context (Optional)** I am assembling a grass genome with some degree of heterozygosity ( > 1.5%, estimated by genomescope2 with short pair-end reads), the estimated haploid genome size is around 2.8 Gb, and the assembly size I got from NextDenovo is 3.5 Gb, which is expected because of the heterozygosity. However, after nextpolish, the total BUSCO I got is only 87% (s:43%, d:44%). I am wondering why the BUSCO is low and how could I improve it? One idea I got is to change the nextgraph_options, for example the -q value. So far I used -q 10 as it is said in the FAQ the best value, but after reading #130, I realized that I might try -A instead of -q, or might try -q 5? Could you please help me understand -A and -q? What does the number (5-16) mean for -q? Is it possible to improve BUSCO by playing with -A and -q?

Thank you very much, Yutang

Yutang-ETH commented 1 year ago

Here I paste my run.cfg:

job_type = local job_prefix = nextDenovo task = all # 'all', 'correct', 'assemble' rewrite = yes # yes/no deltmp = yes rerun = 3 parallel_jobs = 6 input_type = raw read_type = ont input_fofn = ./input.fofn workdir = ./kolumbus_nextdenovo

[correct_option] read_cutoff = 2k genome_size = 2.8g seed_depth = 40 pa_correction = 6 sort_options = -m 100g -t 60 minimap2_options_raw = -I 100G -t 10 correction_options = -p 10

[assemble_option] minimap2_options_cns = -I 100G -t 10 nextgraph_options = -a 1 -q 10

Yutang-ETH commented 1 year ago

Here is the assembly statistics:

Type Length (bp) Count (#) N10 52246795 6 N20 38569451 14 N30 29746267 24 N40 23908370 37 N50 15383517 57 N60 11340932 84 N70 7721969 121 N80 5067527 177 N90 2492629 276

Min. 144673 - Max. 70971017 - Ave. 5893577 - Total 3524359418 598

The program finished without any error, so I guess no need to paste the log here, but if you want to have a look, I can provide it.

Thank you very much again and look forward to your reply.

Best wishes, Yutang

Yutang-ETH commented 1 year ago

By the way, I did KAT analysis for the assembly, we can see that there is some black in the second peak, which means some homozygous regions are not present in the assembly, which might be the reason of the low BUSCO value. Kolumbus_nextpolish_kat

moold commented 1 year ago

-A use to control wether to output alternative contigs (aka heterozygous sequences), but some sequences may be misclassified as heterozygous. -q use to control the min short branch length (aka short contigs). Becasue your genome is highly heterozygous, I think you can try both.

Yutang-ETH commented 1 year ago

Hi @moold ,

Thank you very much for your reply. I see. In my first run, I tried nextgraph_options = -a 1 -q 10 and the statistics is what I shared above. Yesterday, I had my second attempt with nextgraph_options = -a 1 -A, the statistics is following: Type Length (bp) Count (#) N10 35553848 7 N20 24637086 19 N30 20071967 35 N40 15791253 55 N50 12563277 80 N60 10122220 111 N70 8317248 149 N80 5594572 200 N90 3215398 282

Min. 90334 - Max. 70572383 - Ave. 6207453 - Total 3519625887 567

Comparing to -q 10, -A outputs less number of contigs (567 to 598), the assembly size is a bit smaller and less contiguous. I am wondering if I want to increase the assembly size, should I turn up the -q value or turn down the value? I assume increasing the assembly size could recovery some BUSCO genes? Or what do you think? What parameters would you recommend me to tune for nextgraph_options? The problem I have now is, some regions in the genome is missing in the assembly (according to the kmer plot), I would like to have these missing regions assembled. Simply, I wish more alleles to be separated rather than collapsed.

By the way, am I understanding it right, when you say I can use -q and -A both, do you mean I can include them both in the command, like nextgraph_options = -a 1 -A -q 10?

Thank you very much for your help.

Best wishes, Yutang

moold commented 1 year ago

Try somthing like this -A -q 10, I don't have a good suggestion, except to try more,because each genome has its own characteristics.

Yutang-ETH commented 1 year ago

Thank you very much for your help. I will try what you suggested.

Close the thread for the moment, but if I get any results, I will post them here. Thank you very much again.

Best wishes, Yutang

Yutang-ETH commented 1 year ago

Hi @moold,

I saw this while I was checking files NextDenovo produced:

hostname

Min. 77015 - Max. 77839273 - Ave. 5977250 - Total 5056753236 846 [WARNING] 2023-02-07 22:13:22 Unfinished assembly, this is a limited version, currently only supports assembly for genome size < 3500000000 bp, please ask for help. [INFO] 2023-02-07 22:13:22 CMD: /scratch/yutang/NextDenovo/bin/nextgraph -a 1 -A -q 10 -f /scratch/yutang/kolumbus/./kolumbus_nextdenovo/03.ctg_graph/01.ctg_graph.input.seqs -o nd.asm.p.fasta /scratch/yutang/kolumbus/./kolumbus_nextdenovo/03.ctg_graph/01.ctg_graph.input.ovls [INFO] 2023-02-07 22:13:22 Real time: 1056.843 sec; CPU: 109.155 sec; Peak RSS: 0.917 GB

real 17m37.081s user 1m26.475s sys 0m22.722s touch /scratch/yutang/kolumbus/kolumbus_nextdenovo/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph1/nextDenovo.sh.done

What does the WARNING message mean? Is this the reason why I cannot get assembly size over 3.5 Gb? How could we solve this problem?

Thank you very much.

Best wishes, Yutang

moold commented 1 year ago

See previous issues.

Yutang-ETH commented 1 year ago

Hi @moold ,

Thank you very much for your reply, but could you please indicate which issue I should refer to?

I saw #94 showing the same issue and the size of their genome assembly is also around 3.5 Gb, but I don't think the missing completeness is due to polishing. I think some sequences are somehow discarded because of the limitation (current version only supports size < 3.5 Gb) mentioned above.

Besides, I found in the Oat paper, https://www.nature.com/articles/s41588-022-01127-7, saying that they assembled hexaploid Oat (genome size 10 Gb) using NextDenovo (v2.0-beta.1), I am wondering if there is no assembly size limitation in v2.0-beta.1?

I really appreciate your help and input, but I don't think my issue is solved. Could you please give more explanations and suggestions? Thank you very much again.

Best wishes, Yutang

moold commented 1 year ago

NextDenovo only supports our cooperative projects in the early stage, but since we plan to publish this software, it will be fully open source in the future. But if you want to use it now, please send me your email, and I will send you an unlimited version.

Yutang-ETH commented 1 year ago

Hi @moold,

Thank you very much for your honest reply. This is my email,yutang.chen@usys.ethz.ch Regarding the unlimited version, does it also have a GNU license as the current version? It's embarrassing to ask such a question, but just want to make sure everything is clear or transparent.

Best wishes, Yutang