EichlerLab / pav

Phased assembly variant caller
98 stars 8 forks source link

OSError in "call_inv_batch" rule in long runs with multiple genomes #28

Closed svenwillger closed 1 year ago

svenwillger commented 1 year ago

Hi,

I'm trying to run PAV on 9 assembled mouse genomes (it works flawless for me when running PAV on 3 genomes) and it starts normal and runs for several hours, but at some point the pipeline crashes on a "call_inv_batch" rule in the middle of the run of the rule with the an error message stating that a "Function is not implemented".

I'm using 32 cores and >300GB memory (and I already tried adding more each to no avail).

Here is a snippet from the log file "invcall#.log":

Scanning for inversions in flagged region: chr2:105969344-105972199 (flagged region record id = chr2-105969344-RGN-2855)
Scanning region: chr2:105967344-105974199
Found no inverted k-mer states after 1 expansion(s)
Scanning for inversions in flagged region: chr2:113010852-113012402 (flagged region record id = chr2-113010852-RGN-1550)
Scanning region: chr2:113008852-113014402
Found no inverted k-mer states after 1 expansion(s)
Scanning for inversions in flagged region: chr2:122606103-122606795 (flagged region record id = chr2-122606103-RGN-692)
Scanning region: chr2:122604103-122608795
Received return code 1 from scripts/density.py for region chr2:122604103-122608795:
Traceback (most recent call last):
  File "/opt/pav/scripts/density.py", line 549, in <module>
    get_smoothed_density()
  File "/opt/pav/scripts/density.py", line 230, in get_smoothed_density
    pool = mp.Pool(threads, initializer=init_process)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 191, in __init__
    self._setup_queues()
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 346, in _setup_queues
    self._inqueue = self._ctx.SimpleQueue()
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/context.py", line 113, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 341, in __init__
    self._rlock = ctx.Lock()
                  ^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/context.py", line 68, in Lock
    return Lock(ctx=self.get_context())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/usr/local/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__
    sl = self._semlock = _multiprocessing.SemLock(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 38] Function not implemented
paudano commented 1 year ago

Python multiprocessing isn't supported on systems without /dev/shm, which is a small in-memory file system, and it looks like this error comes up if it's not present. If you run mount | grep /dev/shm on the machine where PAV is running, you should see something like "tmpfs on /dev/shm type tmpfs (rw)". If not, that's the source of the error. I assume there were no inversions to resolve in the samples that completed successfully, and it's trying to detect an inversion in one of the new samples.

I created an update (not yet pushed) that will not try to use multiprocessing if the number of threads for inversion detection is less than 2. You'll have to pull the update, make a few parameter changes in config.json to set inversion threads to 1 (will run slower), and run from where it crashed (no need to start over). Would you like to test it? Let me know if you are running a native PAV install (i.e., running rundist or Snakemake directly from a git clone) or if you are running on Docker or Singularity.

svenwillger commented 1 year ago

Thanks for the quick reply. The output of mount | grep /dev/shm is "tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)". I'm running a native PAV install with Snakemake because I'm on a cluster that doesn't allow connection to the outside world and the version I'm using is v2.2.3. I'm happy to test the new version even if it takes a little bit longer.

paudano commented 1 year ago

I cannot imagine why it's failing on your cluster. Everything I can find about this error is related to multiprocessing on AWS Lambda. Are you sure that there's not one node with a full or missing /dev/shm that might be causing this?

I just pushed v2.2.4.1 with a workaround. If you set "inv_threads" and "inv_threads_lg" to "1" in config.json, then it won't try to parallelize inversion detection using the Python multiprocessing library at all (since v2.2.4.1). This is likely to make PAV run very slowly, inversion resolution is already a large part of the PAV runtime (I'm planning to replace it with a better method eventually).

As a side note, you can pull the PAV Singularity image and save it as a SIF file, then run PAV from that. All dependencies will be included in that image, so it might simplify things later on.

paudano commented 1 year ago

Also, nothing changed between v2.2.3 and v2.2.4 unless you were using the LRA aligner, so the version change shouldn't affect anything. Let me know if it's a concern or problem. I'll going to re-release 2.2.4.1 as 2.2.5 (the last number is only set for dev builds) once it has been tested.

paudano commented 1 year ago

Were you able to complete PAV runs with the update?

svenwillger commented 1 year ago

I was able to complete PAV with 8 samples (with 1 restart), even with v2.2.3 and tried over the weekend to run it with more (18) samples but then the same error occurred again. I was hoping to run it as far as possible with multiple threads and when the error appears use the update. Unfortunately, after the 2nd restart the snakemake workflow started all over again, because some temp files were removed. I switched now completely to the updated version and just 1 thread but it'll take some time. I'll let you know how it goes.

paudano commented 1 year ago

If it helps, add --nt to the rundist or Snakemake command to retain temporary files until things complete.

svenwillger commented 1 year ago

I ran the PAV now separately for each sample and that worked without problem even with 12 or more cores. However, I realized that some samples were "tricky" and even with 12 cores and >100GB memory needed more than 24h and the resulting vcf.gz files are >1GB. Those samples might have clogged up the pipeline and caused the multiprocessing issue.

paudano commented 1 year ago

It's possible that assembly quality can cause this if it leads to a large number of indels, see if the indel count is really high for those samples (SNV and SVs may also be higher).

If it is assembly quality and you can identify loci in the assembly where there are problems such as collapses (e.g. the Flagger pipeline from HPRC), then you can feed PAV a BED file of misassemblies in assembly coordinates and it will ignore any variants in those loci. That would improve the runtime, remove false variants, and prevent PAV from dropping true variants inside false-joins that led to a large false deletion (PAV removes small variants inside deletions on the same haplotype). The config option is tig_filter_pattern, and it works just like asm_pattern with {asm_name} and {hap} wildcards inserted in it. Not sure if it's useful for you, but it's there if you need it.

I'm still not sure why the OSError was thrown, that's very strange. Perhaps memory limits were reached and it denied creating a file in /dev/shm?

Whatever it was, thanks for reporting this because it still led to a workaround for systems where /dev/shm is not available.