Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
360 stars 53 forks source link

OverflowError: cannot serialize a string larger than 2 GiB #69

Open arnkress opened 4 years ago

arnkress commented 4 years ago

Hi, I am assembling a ~3.4 Gb genome, and am getting errors at the step 02.cns_align.

Here is the error from the main log:

[ERROR] 2020-06-04 07:20:00,950 get_cns failed: please check the following logs: [ERROR] 2020-06-04 07:20:00,951 /data/NextDenovo/01_rundir/02.cns_align/01.get_cns.sh.work/get_cns00/nextDenovo.sh.e [ERROR] 2020-06-04 07:20:00,951 /data/NextDenovo/01_rundir/02.cns_align/01.get_cns.sh.work/get_cns10/nextDenovo.sh.e

And in the file 01_rundir/02.cns_align/01.get_cns.sh.work/get_cns00/nextDenovo.sh.e :

hostname cd /data/NextDenovo/01_rundir/02.cns_align/01.get_cns.sh.work/get_cns00 time python /soft/ngs/NextDenovo/lib/nextcorrect.py -f /data/NextDenovo/01_rundir/02.cns_align//01.get_cns.input.idxs -i /data/NextDenovo/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align00/input.seed.001.sorted.ovl -p 15 -b -max_lq_length 10000 -o cns.fasta; [INFO] 2020-06-03 16:05:51,889 Corrected step options: [INFO] 2020-06-03 16:05:51,889 Namespace(blacklist=False, dbuf=False, fast=False, idxs='/data/NextDenovo/01_rundir/02.cns_align//01.get_cns.input.idxs', max_cov_aln=130, max_lq_length=10000, min_cov_base=4, min_cov_seed=10, min_error_corrected_ratio=0.8, min_len_aln=500, min_len_seed=10000, out='cns.fasta', ovl='/data/NextDenovo/01_rundir/01.raw_align/03.sort_align.sh.work/sort_align00/input.seed.001.sorted.ovl', process=15, split=False) [INFO] 2020-06-03 16:05:52,134 Start a cns worker in 14791 from parent 14777 [INFO] 2020-06-03 16:05:52,134 Start a cns worker in 14793 from parent 14777 [INFO] 2020-06-03 16:05:52,135 Start a cns worker in 14795 from parent 14777 [INFO] 2020-06-03 16:05:52,135 Start a cns worker in 14797 from parent 14777 [INFO] 2020-06-03 16:05:52,136 Start a cns worker in 14799 from parent 14777 [INFO] 2020-06-03 16:05:52,137 Start a cns worker in 14801 from parent 14777 [INFO] 2020-06-03 16:05:52,138 Start a cns worker in 14803 from parent 14777 [INFO] 2020-06-03 16:05:52,138 Start a cns worker in 14805 from parent 14777 [INFO] 2020-06-03 16:05:52,139 Start a cns worker in 14807 from parent 14777 [INFO] 2020-06-03 16:05:52,140 Start a cns worker in 14809 from parent 14777 [INFO] 2020-06-03 16:05:52,140 Start a cns worker in 14811 from parent 14777 [INFO] 2020-06-03 16:05:52,141 Start a cns worker in 14813 from parent 14777 [INFO] 2020-06-03 16:05:52,142 Start a cns worker in 14815 from parent 14777 [INFO] 2020-06-03 16:05:52,143 Start a cns worker in 14817 from parent 14777 [INFO] 2020-06-03 16:05:52,143 Start a cns worker in 14820 from parent 14777 Traceback (most recent call last): File "/soft/ngs/NextDenovo/lib/nextcorrect.py", line 304, in main(args) File "/soft/ngs/NextDenovo/lib/nextcorrect.py", line 226, in main worker, read_seq_data(args, corrected_seeds), chunksize=1): File "/usr/local/python/miniconda/envs/nextdenovo/lib/python2.7/multiprocessing/pool.py", line 673, in next raise value OverflowError: cannot serialize a string larger than 2 GiB Command exited with non-zero status 1 42484.83user 4473.74system 2:18:10elapsed 566%CPU (0avgtext+0avgdata 42352176maxresident)k 6227512inputs+864584outputs (43major+1645951614minor)pagefaults 0swaps

It seems to be caused by the limitation of multiprocessing/pickle libs under Python 2.7. Is there any workaround ?

Thanks

moold commented 4 years ago

Hi, if you are familiar with python, you can refer here or add some exit codes at line 143 in the nextcorrect.py file if it generates lots of data, and then you can continue to run the remaining data multiple times manually, or increase the value of seed_cutfiles (need to re-run all pipelines), or wait for the next version which is compatible with python3, (I will release it in a few weeks).