bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
983 stars 355 forks source link

Issue with mutect PON causing symbolic link error #3387

Open waemm opened 3 years ago

waemm commented 3 years ago

Hi everyone,

I have received this error several times when running bcbio with a PON for mutect. It looks like too many instances are trying to access this file? I'm not sure what is causing this or if anyone has suggestions as to how I could prevent this from happening? If I rerun bcbio it continues on without any issues.

The error: [12:apply]: OSError: [Errno 40] Too many levels of symbolic links: '/shared/pipeline-user/run_data/Exomedata/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/mutect2/panels/pon_v2.vcf.gz.tbi' [40:apply]: OSError: [Errno 40] Too many levels of symbolic links: '/shared/pipeline-user/run_data/Exomedata/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/mutect2/panels/pon_v2.vcf.gz.tbi'

Version info

Your sample configuration file:

# Tumor run template
---
details:
- algorithm:
    aligner: bwa
    mark_duplicates: true
    recalibrate: true
    remove_lcr: true
    variantcaller: [mutect2]
    svcaller: [cnvkit]
    background:
      variant: /shared/mapping_bias/pon_v2.vcf.gz
      cnv_reference:
        cnvkit: /shared/analysis_data/batch1-cnvkit-background.cnn
    variant_regions: /shared/capture_regions_flanked/cleaned-Exome_v1_hg38_Targets_Standard.100flank-merged.bed.gz
    sv_regions: /shared/capture_regions_flanked/svregions-cleaned-Exome_v1_hg38_Targets_Standard.100flank.bed.gz
    vcfanno: [somatic]
    tools_off:
    - bcftools
    - snpeff
    - viral
    - gemini
    - samtools
    - peddy
    - contamination
    - multiqc
  analysis: variant2
  genome_build: hg38
  description: neobatch4
fc_date: '2020-12-03'
fc_name: 'tumor_only'
upload:
  dir: ../final

Observed behavior Error message or bcbio output:

Sending a shutdown signal to the controller and engines.
2020-12-04 11:15:20.252 [IPClusterStop] Using existing profile dir: '/shared/pipeline-user/run_data/Exome_data_/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/log/ipython'
2020-12-04 11:15:20.256 [IPClusterStop] Stopping cluster [pid=2909] with [signal=<Signals.SIGINT: 2>]
2020-12-04 11:15:20.728 [IPClusterStop] Using existing profile dir: '/shared/pipeline-user/run_data/Exome_data_/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/log/ipython'
2020-12-04 11:15:20.731 [IPClusterStop] Stopping cluster [pid=2909] with [signal=<Signals.SIGINT: 2>]
Traceback (most recent call last):
  File "/shared/pipeline-user/tools/local/bin/bcbio_nextgen.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==1.2.3', 'bcbio_nextgen.py')
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 245, in <module>
    main(**kwargs)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/pipeline/main.py", line 58, in run_main
    fc_dir, run_info_yaml)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/pipeline/main.py", line 91, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/pipeline/main.py", line 154, in variant2pipeline
    samples = genotype.parallel_variantcall_region(samples, run_parallel)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/variation/genotype.py", line 208, in parallel_variantcall_region
    "vrn_file", ["region", "sam_ref", "config"]))
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/distributed/split.py", line 35, in grouped_parallel_split_combine
    final_output = parallel_fn(parallel_name, split_args)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/bcbio_nextgen-1.2.3-py3.6.egg/bcbio/distributed/ipython.py", line 137, in run
    for data in view.map_sync(fn, items, track=False):
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/view.py", line 344, in map_sync
    return self.map(f,*sequences,**kwargs)
  File "<decorator-gen-140>", line 2, in map
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/view.py", line 52, in sync_results
    ret = f(self, *args, **kwargs)
  File "<decorator-gen-139>", line 2, in map
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/view.py", line 37, in save_ids
    ret = f(self, *args, **kwargs)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/view.py", line 1114, in map
    return pf.map(*sequences)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/remotefunction.py", line 299, in map
    return self(*sequences, __ipp_mapping=True)
  File "<decorator-gen-122>", line 2, in __call__
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/remotefunction.py", line 80, in sync_view_results
    return f(self, *args, **kwargs)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/remotefunction.py", line 285, in __call__
    return r.get()
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/asyncresult.py", line 169, in get
    raise self.exception()
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/client/asyncresult.py", line 228, in _resolve_result
    results = error.collect_exceptions(results, self._fname)
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/error.py", line 233, in collect_exceptions
    raise e
  File "/shared/pipeline-user/bcbio/anaconda/lib/python3.6/site-packages/ipyparallel/error.py", line 231, in collect_exceptions
    raise CompositeError(msg, elist)
ipyparallel.error.CompositeError: one or more exceptions from call to method: variantcall_sample
[12:apply]: OSError: [Errno 40] Too many levels of symbolic links: '/shared/pipeline-user/run_data/Exome_data_/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/mutect2/panels/pon_v2.vcf.gz.tbi'
[40:apply]: OSError: [Errno 40] Too many levels of symbolic links: '/shared/pipeline-user/run_data/Exome_data_/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/mutect2/panels/pon_v2.vcf.gz.tbi'
naumenko-sa commented 3 years ago

Hi @waemm! Thanks for reporting!

I don't think we are symlinking in https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/distributed/split.py#L18.

Usually, it is advised to dig around the directory with:

cd /suspected/directory
find -L ./ -mindepth 15

(shows files with more than 15 level depth, i.e. circular symlinks).

Have you tried to investigate the directory? /shared/pipeline-user/run_data/Exome_data_/Tumor_only_neo_batch4_wrapper_run/tumor_only_neobatch4/tumor_only_neobatch4_samples/work/mutect2/

What is the real path? Are there any symlinks involved?

You suspected that too many processes were accessing the file. How many samples are you processing? What is your parallel configuration, i.e how many worker jobs are created?

We had a somewhat related issue: https://github.com/bcbio/bcbio-nextgen/issues/3167 @gis-nlsim, have you discovered anything useful since then?

Sergey

gis-nlsim commented 3 years ago

Sorry, been caught up with other projects so I haven’t been trying to install bcbio. Will try it again at the end of this month.

waemm commented 3 years ago

Hi @naumenko-sa , thanks for your reply! I am not sure what is causing this. It has only happened since I have included a PON for mutect2. About your questions:

  1. This is the real path, it looks like bcbio copies the original files (which were also not symlinked)
  2. Currently we use an SGE cluster with a single mounted drive (nvme) for all read/write, bcbio is generally run on a 10 node (16 core per node) cluster, if we assume 1 core per process that is 160 processes accessing the pon file. I mean we occasionally use 20 nodes but I don't believe this error is any better with 10 nodes.
  3. We're only processing a handful of samples 5-10 mostly.

I am wondering as it seems to take issue with the index file in both my case and @gis-nlsim . Could it be how this file is read by the process? or an issue with the OS not allowing enough connections to it? It is a really strange error as nothing is being symlinked. I did see this issue being associated to gzip before (completely unrelated issue on different software), not sure who I could ask or who might know what is going on here?

Im not sure if it makes a difference that we're using a single mounted drive across the whole cluster? this has never been an issue but I bring it up just in case.

naumenko-sa commented 3 years ago

Hi @waemm !

All clusters use one or another shared file system, so that should not be an issue. Have you tried to reduce the N of reading processes to test the high load hypothesis? I.e. start bcbio with 1, 2, 5 nodes (16, 32, 80 cores)? Will it pass?

Not sure if that is related, sometimes increasing memory of a controller job helps -r conmem=4: https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#ipython-parallel

Sergey

naumenko-sa commented 3 years ago

Closing for now, please feel free to re-open if there is evidence for the further investigation.

dauss75 commented 2 years ago

I have the same problem as waemm, it seems we have no solution for this yet.

naumenko-sa commented 1 year ago

Sorry, I missed it. I can reopen, @dauss75 could you please describe it again that I'd be able to create a reproducible issue? SN

asee-imagia commented 1 year ago

Hi @naumenko-sa !

I'm running into a similar issue. As above, it is caused by a PON file provided to mutect2: OSError: [Errno 40] Too many levels of symbolic links: '/home/pipeline/bcbio/project/work/mutect2/panels/1000g.hg38.vcf.gz.tbi'

Here is a summary of our setup:

  1. We have a custom docker image in which bcbio v1.2.9 is installed. It lives at /home/pipeline/bcbio/ inside the container, and the bcbio work directory is at /home/pipeline/bcbio/project/work/.
  2. This container is launched as a Nextflow process, either locally (we use AWS EC2 instances), or via AWS Batch.
  3. When we run the pipeline locally on an EC2 instance, it completes without any error. The error is only thrown when the pipeline is run through AWS Batch.
  4. From within the container, before launching bcbio, we cp -L 1000g.hg38.vcf.gz.tbi /home/pipeline/bcbio/1000g.hg38.vcf.gz.tbi, which copies the PON index file from the nextflow staging area to a place where bcbio will find it (similarly with other ressources).
  5. We originally used mv, and then tried cp -L to see if it fixed the problem, to no avail.

I've attached the full Nextflow logs, which contain the bcbio logs, for the run that failed with the symlink error on AWS Batch: bcbio.awsbatch.failure.log. I can also share the logs for the same run which terminates successfully when it is ran locally.

Thank you!