EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

Detect step using local python installation #17

Closed RSherman15 closed 5 years ago

RSherman15 commented 5 years ago

I'm getting the following error on the detect step, which seems to be due to the use of my machine's installed python rather than smrtsv2s conda environments. However, I'm not sure how to force it to use the correct python install, if that is indeed the issue.

rule detect_gaps_merge_batches:
    input: detect/gaps/batch/0_ins.bed, detect/gaps/batch/1_ins.bed, detect/gaps/batch/2_ins.bed, detect/gaps/batch/3_ins.bed, detect/gaps/batch/4_ins.bed, detect/gaps/batch/5_ins.bed, detect/gaps/batch/6_ins.bed, detect/gaps/batch/7_ins.bed, detect/gaps/batch/8_ins.bed, detect/gaps/batch/9_ins.bed, detect/gaps/batch/10_ins.bed, detect/gaps/batch/11_ins.bed, detect/gaps/batch/12_ins.bed, detect/gaps/batch/13_ins.bed, detect/gaps/batch/14_ins.bed, detect/gaps/batch/15_ins.bed, detect/gaps/batch/16_ins.bed, detect/gaps/batch/17_ins.bed, detect/gaps/batch/18_ins.bed, detect/gaps/batch/19_ins.bed
    output: detect/gaps/gaps_no_cov_ins.bed
    jobid: 40
    wildcards: svtype=ins

Traceback (most recent call last):
  File "/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/scripts/detect/PrintGapSupport.py", line 4, in <module>
    import numpy as np
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/core/__init__.py", line 26, in <module>
    raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: /home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/numpy/core/multiarray.so: undefined symbol: PyUnicodeUCS2_FromObject
paudano commented 5 years ago

When you run SMRT-SV, are you running the the smrtsv shell script, or smrtsv.py? The shell script sets up the environment for SMRT-SV and runs the pipeline. If you are running smrtsv.py directly, it will use your external environment.

Originally, README.md had this wrong, but I have since fixed that.

Let me know if that resolves this issue.

RSherman15 commented 5 years ago

I am using smrtsv.py -- hadn't noticed that changed in the README. However, I just tried running with smrtsv and got the same error.

paudano commented 5 years ago

I see. It’s finding user installed packages in your home directory. Looks like python -s will do ignore those. I will need to make this change to every python script call in the pipeline.

paudano commented 5 years ago

Made changes so SMRT-SV run python2 and python3 with -s to ignore site-packages in home directories.

Thanks for reporting this.

RSherman15 commented 5 years ago

The fix you made solved this for the detect step, but I'm now seeing the same issue in the assemble step. Is it possible there's a python call you missed in the latest commit (I couldn't find one, but maybe I missed something)?

Here's the piece of the log file with the error in the assemble stage:

[Thu Mar  7 02:33:56 2019]
rule assemble_polish:
    input: region/6-32740000-60000/asm/contigs.fasta, region/6-32740000-60000/asm/contigs.fasta.fai, region/6-32740000-60000/asm/contig_aligned_reads.bam, region/6-32740000-60000/asm/contig_aligned_reads.bam.pbi
    output: region/6-32740000-60000/asm/contigs_polished.fasta
    jobid: 471
    wildcards: region_id=6-32740000-60000
    resources: threads=24

/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/site-packages/snakemake/__init__.py:1559: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  export_cwl=args.export_cwl)
ESC[33mJob counts:
        count   jobs
        1       assemble_polish
        1ESC[0m
Traceback (most recent call last):
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/bin/variantCaller", line 3, in <module>
    from GenomicConsensus.main import main
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/GenomicConsensus/main.py", line 10, in <module>
    import pysam
  File "/home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: /home-4/rsherma8@jhu.edu/.local/lib/python2.7/site-packages/pysam/libchtslib.so: undefined symbol: PyUnicodeUCS2_FromStringAndSize
paudano commented 5 years ago

I was calling variantCaller directly, which is the PacBio program that polishes assemblies (it is the quiver and arrow polishing step). I needed to make a change so that it explicitly calls python from within the PacBio environment.

To run this, first cd into dep (in the SMRT-SV install directory) and run: ln -sf ../conda/build/envs/pacbio/bin/python2 bin/pb_python2

Note: The build system will do this for new installs, I am just having you run it so you don't have to rebuild dep.

After you make make that link, pull down the latest commit and run it. Let me know how that turns out.

RSherman15 commented 5 years ago

This caused a different issue instead:

CalledProcessError in line 250 of /home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/rules/assemble_group.snakefile:
Command ' set -euo pipefail;  pb_python2 -s variantCaller --referenceFilename region/18-11720000-60000/asm/contigs.fasta region/18-11720000-60000/asm/contig_aligned_reads.bam -o region/18-11720000-60000/asm/contigs_polished.fasta -j 24 --algorithm=arrow; ' returned non-zero exit status 2.
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/rules/assemble_group.snakefile", line 250, in __rule_assemble_polish
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in runESC[0m
ESC[31mExiting because a job execution failed. Look above for error message
paudano commented 5 years ago

I pushed a change. Can you try it? I wasn't calling variantCaller correctly.

RSherman15 commented 5 years ago

Still an error:

 count   jobs
        1       assemble_polish
        1ESC[0m
'FRAMERATEHZ'
Traceback (most recent call last):
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcommand/cli/core.py", line 138, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/GenomicConsensus/main.py", line 340, in args_runner
    return tr.main()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/GenomicConsensus/main.py", line 258, in main
    with AlignmentSet(options.inputFilename) as peekFile:
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2746, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1994, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 461, in __init__
    self.updateCounts()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2574, in updateCount
s
    self.assertIndexed()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2404, in assertIndex
ed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1951, in _assertInde
xed
    self._openFiles()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2075, in _openFiles
    resource = IndexedBamReader(location)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 374, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 183, in __init__
    self._loadReadGroupInfo()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 95, in _loadReadGroupInfo
    rgFrameRate = ds["FRAMERATEHZ"]
KeyError: 'FRAMERATEHZ'
[ERROR] 'FRAMERATEHZ'
Traceback (most recent call last):
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcommand/cli/core.py", line 138, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/GenomicConsensus/main.py", line 340, in args_runner
    return tr.main()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/GenomicConsensus/main.py", line 258, in main
    with AlignmentSet(options.inputFilename) as peekFile:
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2746, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1994, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 461, in __init__
    self.updateCounts()
 File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2574, in updateCounts
    self.assertIndexed()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2404, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 1951, in _assertIndexed
    self._openFiles()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/dataset/DataSetIO.py", line 2075, in _openFiles
    resource = IndexedBamReader(location)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 374, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 183, in __init__
    self._loadReadGroupInfo()
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/pacbio/lib/python2.7/site-packages/pbcore/io/align/BamIO.py", line 95, in _loadReadGroupInfo
    rgFrameRate = ds["FRAMERATEHZ"]
KeyError: 'FRAMERATEHZ'
ESC[32m[Fri Mar  8 03:37:59 2019]ESC[0m
ESC[31mError in rule assemble_polish:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: region/14-74700000-60000/asm/contigs_polished.fastaESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
CalledProcessError in line 250 of /home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/rules/assemble_group.snakefile:
Command ' set -euo pipefail;  pb_python2 -s /home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/bin/variantCaller --referenceFilename region/14-74700000-60000/asm/contigs.fasta region/14-74700000-60000/asm/contig_aligned_reads.bam -o region/14-74700000-60000/asm/contigs_polished.fasta -j 24 --algorithm=arrow; ' returned non-zero exit status 2.
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/rules/assemble_group.snakefile", line 250, in __rule_assemble_polish
  File "/home-net/home-4/rsherma8@jhu.edu/bin/packages/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in runESC[0m
ESC[31mExiting because a job execution failed. Look above for error messageESC[0m
paudano commented 5 years ago

genomicConsensus can be really frustrating. At least it's running, and I don't see any sign of site-packages from your home directory in the output, so that fix seems to have worked.

Questions:

I found two issues with this error: https://github.com/PacificBiosciences/Bioinformatics-Training/issues/42 https://github.com/PacificBiosciences/kineticsTools/issues/58

From the first issue, it looks like running from BAM files might be necessary. The second one was fixed by pbindex, but the pipeline already does that.

RSherman15 commented 5 years ago

I'm using RS II reads, in bax.h5 format for the HG002 Genome in a Bottle data (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/hdf5).

Are you saying I'll need to rerun the whole pipeline with bams?

paudano commented 5 years ago

First try running with --asm-polish for the run or assemble command. See if that makes genomicConsensus happy. If not, then you might need to run from BAM files to appease it. If that doesn't work, then I am out of ideas. It might be a question for PacBio.

paudano commented 5 years ago

Does running with --asm-polish quiver make any difference?

RSherman15 commented 5 years ago

No, it didn't help -- same error. I still need to try re-running with bams.

paudano commented 5 years ago

We are going to try replicating this problem with the same data. I'll let you know if we can get it running from BAMs. If so, we will modify the pipeline to fail on BAX input.

RSherman15 commented 5 years ago

Any update on this?

wharvey31 commented 5 years ago

Hi @RSherman15. I have taken over this issue, and am currently attempting to replicate your error. I am currently in the detect step, and will hope to have an answer in the coming days.

RSherman15 commented 5 years ago

Assemblies now seem to be completing, after re-running the whole thing on subread.bam files from bax2bam --allowUnrecognizedChemistryTriple. I had to use --asm-polish quiver, it still crashed with Arrow. It is still in the assemble step, so not sure it will make it to the end and through the genotyping step (which is my real goal) but for now this appears solved. It did waste a lot of time and compute resources to re-run from scratch though; I'd suggest clarifying in the README which starting file formats are allowed for which chemistries.

paudano commented 5 years ago

We had no idea that there would be such a difference running the same data from BAX or from BAM files. In any case, it requires subreads BAM files now and rejects everything else.