median Z for each ROI multiprocessing error

wchow commented 1 year ago

Hi,

Cheers on creating hafeZ. I was trying to run this on my own dataset but encounted a python multiprocessing error during the step "Calculating median Z for each roi".

For context I'm running against a metagenome sample of roughly 38k contigs, mapping with around 70K PE illumina reads. My runtime command is:

hafeZ.py -r1 Read1.fastq.gz -r2 Read2.fastq.gz -o $PWD -f allcontigs.fasta -t 40 -D db/hafeZ_db/ -T phrogs -M 5G

The output error is:

Process ForkPoolWorker-111:
Traceback (most recent call last):
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/pool.py", line 131, in worker
    put((job, i, result))
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/queues.py", line 377, in put
    self._writer.send_bytes(obj)
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 409, in _send_bytes
    self._send(header)
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/queues.py", line 377, in put
    self._writer.send_bytes(obj)
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/scratch/conda/envs/hafez/lib/python3.10/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Any idea what could be the issue. Do I need to downsample?

thanks again for your help.

Will

Chrisjrt commented 1 year ago

Hey,

Sorry for the delayed response.

Hmm... this is a new one 🤔... I think this is an error to do with a process taking up to much memory that is then killed by the OS.

I'll say that hafeZ isn't well suited for running on metagenomic data, as it was designed for use on individual genomes, so i think its freaking out cos it found too many ROIs.

I'd recommend sorting your metagenome into MAGs and then try and rerun hafeZ on each of the MAGs as hafeZ's calculations are also expecting single genomes. For best results I'd also probably recommend mapping reads to the MAGs ahead of time and using only those reads that mapped to the MAG as input for hafeZ.

'True' metagenome functionality it's something I'd like to add in the future but won't be adding for a wee while.

Hope that helps!

wchow commented 1 year ago

Hi @Chrisjrt ,

Thanks for the information, I figured it might me trying to throw the kitchen sink into the tool. Thanks I'll have a bit of a think as well. Out of curiosity, are there any cases where a drop in coverage from the baseline can indicate an insertion event (like if it is low/rare event). Thanks!

Chrisjrt / hafeZ

median Z for each ROI multiprocessing error #7