EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

Can't build conda deps. #54

Closed osowiecki closed 3 years ago

osowiecki commented 4 years ago

Solving environment: failed with initial frozen solve. Retrying with flexible solve. Solving environment: / Found conflicts! Looking for incompatible packages. failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

Your python: python=3.6.2

If python is on the left-most side of the chain, that's the version you've asked for. When python appears to the right, that indicates that the thing on the left is somehow not available for the python version you are constrained to. Note that conda will not change your python version to a different minor version unless you explicitly specify that.

The following specifications were found to be incompatible with each other:

Package python_abi conflicts for: boost=1.70.0 -> python[version='>=3.6,<3.7.0a0'] -> pip -> setuptools -> certifi[version='>=2016.09'] -> python_abi=2.7[build=_cp27mu] freebayes=1.3.1 -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> certifi[version='>=2016.09'] -> python_abi=2.7[build=_cp27mu] python=3.6.2 -> pip -> setuptools -> python_abi[version='3.6|3.6.',build='_pypy36_pp73|_cp36m'] freebayes=1.3.1 -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> python_abi[version='3.6.|3.7.',build='_cp37m|_cp36m'] freebayes=1.3.1 -> python[version='>=2.7,<2.8.0a0'] -> python_abi==3.6[build=_pypy36_pp73]

Makefile:16: recipe for target 'build/install_flags/env_tools_install' failed make[1]: [build/install_flags/env_tools_install] Error 1 make[1]: Leaving directory '/media/2/CORN/smrtsv2/dep/conda' Makefile:87: recipe for target 'install_flags/dep_conda_build' failed make: [install_flags/dep_conda_build] Error 2

######################### Can you list a working list of packages used or create a docker image of this apllication?

osowiecki commented 4 years ago

Removed all version numbers from install_*.sh files and set conda version to latest. Everything installed. Will notify you if something crashes during the analysis.

gkaur commented 4 years ago

Facing the same issue. And I did as you suggested and removed all the version numbers from the install*.sh files.

I was wondering what the developers suggest regarding changing the dependency versions. Would this result in any changes to the results?

osowiecki commented 4 years ago

Here are my current package lists.

PACKAGES_PYTHON2

https://pastebin.com/BmJGw0g8

PACKAGES_PACBIO

https://pastebin.com/AZWwtZzR

PACKAGES_PYTHON3

https://pastebin.com/MdW29B0a

PACKAGES_TOOLS

https://pastebin.com/S8gfCkp8

gkaur commented 4 years ago

Thanks @osowiecki . Most of the version are same for me. Few exceptions are python packages that have slightly different version numbers.

Were you able to run this tool on your data successfully ?

osowiecki commented 4 years ago

Still running (50% of Assembly step after 24h currently). I had to run it on stronger machine as

  1. "canu iteration count too high, stopping pipeline (most likely a problem in the grid-based computes)" Canu required more resources to run. 40 threads and 189GB of RAM wasn't enough, but on a stronger machine it ran.

  2. "samtools merge: fail to open "assemble/group/gr-4-24000000-1000000/contig.bam": Too many open files" Increase your opened file limit on Linux and it will run then.

  3. Arrow polisher crashed on my RSII Pacbio data due to incompatible chemistry. I swiched to Quiver and it runs.

osowiecki commented 4 years ago

samtools view -h assemble/local_assemblies.bam | python3 -s /media/raid/smrtsv2/scripts/call/TilingPath.py /dev/stdin > call/tiling_contigs.tab Traceback (most recent call last): File "/media/raid/smrtsv2/scripts/call/TilingPath.py", line 82, in ovpMidPoint = intvs.search(midPoint) AttributeError: 'IntervalTree' object has no attribute 'search'

EDIT :

intervaltree=2.1.0

Is required in this older version.

EDIT2: scikit-learn=0.20.2 the lastest version in conda is incompatible,

gkaur commented 4 years ago

By changing the miniconda version I was able to resolve the dependency issue. See here: https://github.com/EichlerLab/smrtsv2/issues/49#issuecomment-615439675

osowiecki commented 4 years ago

By changing the miniconda version I was able to resolve the dependency issue. See here: #49 (comment)

New Pandas also fail. I'll try your solution.

RuleException: KeyError in line 242 of /media/raid/smrtsv2/rules/genotype.snakefile: '0' File "/media/raid/smrtsv2/rules/genotype.snakefile", line 242, in __rule_gt_vcf_get_sample_column File "/media/raid/smrtsv2/smrtsvlib/genotype.py", line 65, in get_sample_column File "/media/raid/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/site-packages/pandas/core/series.py", line 3848, in apply File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer File "/media/raid/smrtsv2/smrtsvlib/genotype.py", line 65, in File "/media/raid/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run Exiting because a job execution failed. Look above for error message

paudano commented 4 years ago

It's looking for a SEX column in the sample manifest file. Make sure the file is setup correctly (double-check GENOTYPE.md section "Sample table").

Thanks for helping to resolve the dependency issues. That accounts for too many issues on Github, so I am probably going to remove it and hope for the best.

Also, SMRT-SV has not been updated to use pbgcpp instead of Arrow, and that will have the latest chemistries. Syntax should be similar though.

osowiecki commented 4 years ago

What do you think about this? Warning to ignore or serious error?

[Tue Apr 21 13:34:40 2020] rule gt_call_sample_insert_delta: input: altref/ref.fasta, sv_calls/sv_calls.bed, samples/s79757-500bp/alignments.cram output: samples/s79757-500bp/temp/insert_delta.tab, samples/s79757-500bp/insert_size_stats.tab jobid: 24 wildcards: sample=s79757-500bp

/media/raid/smrtsv2/scripts/genotype/GetInsertSizeDelta.py:204: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sv_rec['N_INSERT'] = len(insert_array) /media/raid/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/site-packages/pandas/core/series.py:915: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.loc[key] = value /media/raid/smrtsv2/scripts/genotype/GetInsertSizeDelta.py:214: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sv_rec['INSERT_LOWER'] = 0 /media/raid/smrtsv2/scripts/genotype/GetInsertSizeDelta.py:215: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sv_rec['INSERT_UPPER'] = 0 /media/raid/smrtsv2/scripts/genotype/GetInsertSizeDelta.py:210: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sv_rec['INSERT_LOWER'] = sum(insert_array < -z_limit) / sv_rec['N_INSERT'] /media/raid/smrtsv2/scripts/genotype/GetInsertSizeDelta.py:211: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy sv_rec['INSERT_UPPER'] = sum(insert_array > z_limit) / sv_rec['N_INSERT']

osowiecki commented 4 years ago

It's looking for a SEX column in the sample manifest file. Make sure the file is setup correctly (double-check GENOTYPE.md section "Sample table").

SAMPLE SEX DATA s79757-500bp U /media/raid/bam/s79757-500bp.bam s79757-400bp U /media/raid/bam/s79757-400bp.bam s79757-11kb U /media/raid/bam/s79757-11kb.bam s79757-8kb U /media/raid/bam/s79757-8kb.bam

These are my original illumina bam files. Don't know what else can I change in the tab file.

The script fails exactly here in genotype.py :

# Set genotype (GT), genotype quality (GQ), and genotype likelihood (GL)
df_gt['CLASS'] = df_gt.apply(
    lambda row: str(np.argmax(row[['HOM_REF', 'HET', 'HOM_ALT']])) if row['CALLABLE'] else 'NO_CALL',
    axis=1
)

df_gt is fine before that step. There is something wrong with package versions I assume. Investigating.

[EDIT]

With this set of packages, it genotyped my samples without problems. How about that?

python2

conda install -y \ numpy=1.15.4 \ scipy=1.1.0 \ pandas=0.23.4 \ pysam=0.15.1 \ biopython=1.72 \ intervaltree=2.1.0 \ networkx=2.2 \ pybedtools=0.8.0

python3

conda install -y \ numpy==1.18.1 \ scipy=1.4.1\ pandas=0.20.3 \ pysam \ snakemake \ biopython \ ipython \ drmaa \ scikit-learn=0.19.0 \ intervaltree==2.1.0

gkaur commented 4 years ago

Has any one used this on a human genome sample. I am running the assemble step on human PacBio data and it is taking really long. Is there a way to distribute the jobs in the assemble step.

Currently running smrtsv2-2.0.2/smrtsv assemble --asm-cpu 68 --asm-mem 10.

In the end, would like to run this tool on large number of samples. Any ideas to speed up the processing will be great!

osowiecki commented 4 years ago

Has any one used this on a human genome sample. I am running the assemble step on human PacBio data and it is taking really long. Is there a way to distribute the jobs in the assemble step.

Currently running smrtsv2-2.0.2/smrtsv assemble --asm-cpu 68 --asm-mem 10.

In the end, would like to run this tool on large number of samples. Any ideas to speed up the processing will be great!

Use asm-parallel .

ulimit -n 4096 ./smrtsv --tempdir /media/raid/SMRT/temp assemble --asm-cpu 15 --asm-parallel 8 --asm-polish quiver

gkaur commented 4 years ago

I am using version 2.0.2 and get this error: smrtsv.py: error: unrecognized arguments: --asm-parallel

which version are you using?

osowiecki commented 4 years ago

I am using version 2.0.2 and get this error: smrtsv.py: error: unrecognized arguments: --asm-parallel

which version are you using?

I'm using the latest one as of today.

./smrtsv assemble -h

This works only in the "assemble" step. Start this particular step manually and then use "call" and "genotype"

gkaur commented 4 years ago

I am using the latest too. May be you cloned the git repository?

This is what I see:

smrtsv2-2.0.2/smrtsv assemble -h
usage: smrtsv.py assemble [-h]
                          [--asm-alignment-parameters ASM_ALIGNMENT_PARAMETERS]
                          [--mapping-quality MAPPING_QUALITY]
                          [--asm-cpu ASM_CPU] [--asm-mem ASM_MEM]
                          [--asm-polish ASM_POLISH]
                          [--asm-group-rt ASM_GROUP_RT] [--asm-rt ASM_RT]

optional arguments:
  -h, --help            show this help message and exit
  --asm-alignment-parameters ASM_ALIGNMENT_PARAMETERS
                        BLASR parameters to use to align local assemblies.
  --mapping-quality MAPPING_QUALITY
                        Minimum mapping quality of raw reads. Used by "detect"
                        to filter reads while finding gaps and hardstops. Used
                        by "assemble" to filter reads with low mapping quality
                        before the assembly step.
  --asm-cpu ASM_CPU     Number of CPUs to use for assembly steps.
  --asm-mem ASM_MEM     Multiply this amount of memory by the number of cores
                        for the amount of memory allocated to assembly steps.
  --asm-polish ASM_POLISH
                        Assembly polishing method (arrow|quiver). "arrow"
                        should work on all PacBio data, but "quiver" will only
                        work on RS II input.
  --asm-group-rt ASM_GROUP_RT
                        Set maximum runtime for an assembly group. Assemblies
                        are grouped by region, and multiple assemblies are
                        done in one grouped job. This is the maximum runtime
                        for the whole group.
  --asm-rt ASM_RT       Set maximum runtime for an assembly region. This
                        should be a valid argument for the Linux "timeout"
                        command.
osowiecki commented 4 years ago

I'm using the cloned repository. git clone https://github.com/EichlerLab/smrtsv2

optional arguments:
  -h, --help            show this help message and exit
  --asm-alignment-parameters ASM_ALIGNMENT_PARAMETERS
                        BLASR parameters to use to align local assemblies.
  --mapping-quality MAPPING_QUALITY
                        Minimum mapping quality of raw reads. Used by "detect"
                        to filter reads while finding gaps and hardstops. Used
                        by "assemble" to filter reads with low mapping quality
                        before the assembly step.
  --asm-cpu ASM_CPU     Number of CPUs to use for assembly steps.
  --asm-mem ASM_MEM     Multiply this amount of memory by the number of cores
                        for the amount of memory allocated to assembly
                        steps.If multiple simultaneous assemblies are run,
                        then this is multiplied again by that factor (see
                        --asm-parallel).
  --asm-polish ASM_POLISH
                        Assembly polishing method (arrow|quiver). "arrow"
                        should work on all PacBio data, but "quiver" will only
                        work on RS II input.
  --asm-group-rt ASM_GROUP_RT
                        Set maximum runtime for an assembly group. Assemblies
                        are grouped by region, and multiple assemblies are
                        done in one grouped job. This is the maximum runtime
                        for the whole group.
  --asm-rt ASM_RT       Set maximum runtime for an assembly region. This
                        should be a valid argument for the Linux "timeout"
                        command.
  --asm-parallel ASM_PARALLEL
                        Number of simultaneous assemblies to run. The actual
                        thread count will be this times --asm-cpu
gkaur commented 4 years ago

Thanks! I have got this working.

What is your experience in restarting the steps. Do you know if the assemble step restarts well if the job gets killed in the middle?

paudano commented 4 years ago

SMRT-SV batches assemblies into megabase-sized regions. Incomplete batches will be re-run if the pipeline is restarted.

The whole pipeline can be distributed with DRMAA. We used it on an SGE cluster.

SMRTSV_DIR=/path/to/SMRT-SV

FOFN_FILE=/path/to/pacbio-bam/sample.fofn

REF_FA=/path/to/hg38.no_alt.fa

${SMRTSV_DIR}/smrtsv.py --cluster-config ${SMRTSV_DIR}/cluster.eichler.json --drmaalib /path/to/libdrmaa.so.1.0 --distribute run --batches 20 --runjobs "25,20,200,10" --threads 8 ${REF_FA} ${FOFN_FILE}

You'll have to adjust cluster.eichler.json for your cluster. For ours, it multiplies the memory by number of cores (e.g. 4 cores and 2gb is 8 gb total). You'll also have to adjust cluster parameters (--cluster_params), which the values from cluster.eichler.json are dropped into for each rule. If those two things together make parameter strings that your cluster accepts, then it should work.

Yes, we have run many human samples through it, on most samples, it takes more than a week with 1,000 cores.

SMRT-SV is a useful tool, but I wouldn't run it without doing PBSV first (Sniffles second). Because it relies on squashed assemblies, it is going to miss about 40% of your heterozygous SVs.