RuntimeError: The task could not be sent to the workers as it is too large for `send_bytes`.

novitch commented 4 years ago

Hi I would like to use the soft in multi-threading but when i tried I ran into an issue: Here is the complete stdout:

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.app:Setting random seed to 110803
INFO:iss.util:Stitching input files together
INFO:iss.app:Using zero_inflated_lognormal abundance distribution
INFO:iss.app:Using 10 cpus for read generation
INFO:iss.app:Generating 1000000 reads
INFO:iss.app:Generating reads for record: GCA_000710275.1_ASM71027v1_genomic.fna
INFO:iss.app:Generating reads for record: GCA_001600775.1_JCM_11348_assembly_v001_genomic.fna
INFO:iss.app:Generating reads for record: GCA_001890705.1_Aspsy1_genomic.fna
INFO:iss.app:Generating reads for record: GCA_002551515.1_Malafurf_genomic.fna
INFO:iss.app:Generating reads for record: GCA_002901145.1_ASM290114v1_genomic.fna
INFO:iss.app:Generating reads for record: GCF_000001405.39_GRCh38.p13_genomic.fna
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 156, in _feed
    send_bytes(obj_)
  File "/opt/intel/intelpython3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
 File "/opt/intel/intelpython3/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/matalb01/virtual-envs/iss_36/bin/iss", line 10, in <module>
    sys.exit(main())
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/iss/app.py", line 510, in main
    args.func(args)
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/iss/app.py", line 226, in generate_reads
    args.gc_bias) for i in range(cpus))
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
    self.retrieve()
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/matalb01/virtual-envs/iss_36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "/opt/intel/intelpython3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
 File "/opt/intel/intelpython3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
RuntimeError: The task could not be sent to the workers as it is too large for `send_bytes`.

novitch commented 4 years ago

If you have a suggestion I would like to test.. I am working on a linux server with 1To of ram and 120 cpu.. I tried between 10 and 120 cpu and obtain the same error message. thanks Alban.

HadrienG commented 4 years ago

Hi!

Thanks for reporting this. Could you share with me:

the exact command you used
the size of GCF_000001405.39_GRCh38.p13_genomic.fna?

The error message indicates that some data object is too large to be passed around by the multiprocessing library. I haven't tested InSilicoSeq on large(ish) eukaryotes, so it is possible the human genome is too big for the default data type used by multiprocessing

I'll test on my side, but in the meantime you can try again with removing the human genome from your input dataset.

Best, Hadrien

novitch commented 4 years ago

Hi Hadrien, The file is 3,1G .. I'll try without and will let you inform.

The complete command is

iss generate --seed 110803 --abundance zero_inflated_lognormal --cpus 10  --genomes ../genomes_db/genomes.fna --seed 110803 --abundance zero_inflated_lognormal  --model hiseq --output simulation_1million_1

Genomes.fna contains 114 genomes (human is the largest, only one in Go)

novitch commented 4 years ago

ok, sit seems to work if i do not use the human genome and do not use the full 120 threads.. 60 threads seems to work instead.

novitch commented 4 years ago

Hi a little update: I thought I could deal with the issue by making my random reads of my community and on another side generating reads of the human genome. => But with the human genome only, the problem still persists. I can't work without human reads, do you think it will be achievable?

HadrienG commented 4 years ago

I will not have time to fix this issue before mid-October unfortunately.

novitch commented 4 years ago

Ok, so i'll try to mix Art for human and your soft for the microbes.

Thanks, Alban.

HadrienG commented 4 years ago

Hi,

I started working on this. I could reproduce the bug when generating reads from a fasta file containing all human chromosome concatenated together as one record.

Any reason you are concatenating instead of using --draft to generate accurate number of reads from each record in the reference genome?

EDIT: I have a fix on the mem branch. You can install from there with

pip install git+https://github.com/HadrienG/InSilicoSeq.git@mem

The fix is currently about 2 times slower than 1.4.x in preliminary tests. It will need to be optimised before I can merge and release an official bugfix.

novitch commented 4 years ago

Hi, I was looking for generating reads with abundance values. So If I undestood correctly, I can't use both draft options and abundance file.

Thanks, for your speed, I'll try with the mem branch.

HadrienG commented 4 years ago

I can't use both draft options and abundance file.

Correct. This should be addressed within the month for release 1.5.0 (see #83 ).

I'll try with the mem branch

Thanks. Don't hesitate to report any bug you might find 😄

HadrienG commented 4 years ago

The fix is implemented in 1.4.4

novitch commented 4 years ago

Thanks Hadrien, Great job for your softs and quick releases :)

HadrienG / InSilicoSeq

RuntimeError: The task could not be sent to the workers as it is too large for `send_bytes`. #119