bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

mirge 2.0 error #2379

Closed mshadbolt closed 5 years ago

mshadbolt commented 6 years ago

Hi I'm trying to run the bcbio smallrnaseq pipeline with mirge and seqbuster but run into the following error running the dev version

[2018-04-25T17:01Z] ['gff', '--sps', 'hsa', '--hairpin', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa', '--gtf', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', '~/work/bcbiotx/tmp0UFt0o', '~/work/mirbase/m05898_s_1_CGTGAT_trimmed/m05898_s_1_CGTGAT_trimmed.mirna']
Traceback (most recent call last):
  File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in <module>
    main(**kwargs)
  File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 43, in run_main
    fc_dir, run_info_yaml)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 87, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 323, in smallrnaseqpipeline
    samples = run_parallel("srna_annotation", samples)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 151, in srna_annotation
    return srna.sample_annotation(*args)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 123, in sample_annotation
    data['config'])
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 261, in _mirtop
    os.path.join(out_dir, out_fn))
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 316, in move
    copy2(src, real_dst)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 144, in copy2
    copyfile(src, dst)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 96, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: '~/work/bcbiotx/tmp0UFt0o/m05898_s_1_CGTGAT_trimmed.gff'

My yaml config looks like this:

resources:
  # default options, used if other items below are not present
  # avoids needing to configure/adjust for every program
  default:
    memory: 3.6G  # 3.6*32 ~= 115G
    cores: 32
    jvm_opts: ["-Xms800m", "-Xmx3600m"]
  gatk:
    jvm_opts: ["-Xms800m", "-Xmx3600m"]
  snpeff:
    jvm_opts: ["-Xms800m", "-Xmx3600m"]
  qualimap:
    memory: 4g
  express:
    memory: 8g
  dexseq:
    memory: 10g
  macs2:
    memory: 8g
  seqcluster:
    memory: 8g
  mirge:
    options: ["-lib ~/software/miRge2/miRge.Libs"]
details:
  - analysis: smallRNA-seq
    algorithm:
      trim_reads: false
      aligner: star
      expression_caller: [seqbuster, mirge]
      species: hsa
    genome_build: hg38-noalt
upload:
  dir: ../final

I then tried with the stable version to see if it did the same and I get the exact same error.

Let me know if you need any further info or want me to try anything else

lpantano commented 6 years ago

Hi,

Sorry about this. I updated mirtop without updating the code in bcbio but it should be fixed in the last bcbio devel.

Thanks for trying.

On Apr 25, 2018, at 2:17 PM, Marion notifications@github.com wrote:

Hi I'm trying to run the bcbio smallrnaseq pipeline with mirge and seqbuster but run into the following error running the dev version

[2018-04-25T17:01Z] ['gff', '--sps', 'hsa', '--hairpin', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa', '--gtf', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', '~/work/bcbiotx/tmp0UFt0o', '~/work/mirbase/m05898_s_1_CGTGAT_trimmed/m05898_s_1_CGTGAT_trimmed.mirna'] Traceback (most recent call last): File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in main(kwargs) File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 43, in run_main fc_dir, run_info_yaml) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 87, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 323, in smallrnaseqpipeline samples = run_parallel("srna_annotation", samples) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init self.results = batch() File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper return apply(f, args, *kwargs) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 151, in srna_annotation return srna.sample_annotation(args) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 123, in sample_annotation data['config']) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 261, in _mirtop os.path.join(out_dir, out_fn)) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 316, in move copy2(src, real_dst) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 144, in copy2 copyfile(src, dst) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/shutil.py", line 96, in copyfile with open(src, 'rb') as fsrc: IOError: [Errno 2] No such file or directory: '~/work/bcbiotx/tmp0UFt0o/m05898_s_1_CGTGAT_trimmed.gff' My yaml config looks like this:

resources:

default options, used if other items below are not present

avoids needing to configure/adjust for every program

default: memory: 3.6G # 3.6*32 ~= 115G cores: 32 jvm_opts: ["-Xms800m", "-Xmx3600m"] gatk: jvm_opts: ["-Xms800m", "-Xmx3600m"] snpeff: jvm_opts: ["-Xms800m", "-Xmx3600m"] qualimap: memory: 4g express: memory: 8g dexseq: memory: 10g macs2: memory: 8g seqcluster: memory: 8g mirge: options: ["-lib ~/software/miRge2/miRge.Libs"] details:

  • analysis: smallRNA-seq algorithm: trim_reads: false aligner: star expression_caller: [seqbuster, mirge] species: hsa genome_build: hg38-noalt upload: dir: ../final I then tried with the stable version to see if it did the same and I get the exact same error.

Let me know if you need any further info or want me to try anything else

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2379, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HHOylbAe9BfSoV80K2jV9Od8cSYoks5tsL29gaJpZM4Tj6yX.

mshadbolt commented 6 years ago

great thanks, I will try it out and let you know how I go.

mshadbolt commented 6 years ago

Ok I think it got a bit further but now it is complaining about not finding libs

[2018-04-25T22:27Z] ['gff', '--sps', 'hsa', '--hairpin', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa', '--gtf', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', '~/work/bcbiotx/tmptDj1z0', '~/work/mirbase/m06174_s_2_CGTGAT_trimmed/m06174_s_2_CGTGAT_trimmed.mirna']
[2018-04-25T22:27Z] Looking for mirdeep2 database for m06174_s_2_CGTGAT_trimmed
[2018-04-25T22:27Z] Resource requests: seqcluster; memory: 8.00; cores: 1
[2018-04-25T22:27Z] Configuring 1 jobs to run, using 1 cores each with 8.00g of memory reserved for each job
[2018-04-25T22:27Z] Timing: cluster
[2018-04-25T22:27Z] multiprocessing: seqcluster_cluster
Traceback (most recent call last):
  File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in <module>
    main(**kwargs)
  File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
    fc_dir, run_info_yaml)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 336, in smallrnaseqpipeline
    samples = run_parallel("seqcluster_cluster", [samples])
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 159, in seqcluster_cluster
    return seqcluster.run_cluster(*args)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 102, in run_cluster
    sample["mirge"] = mirge.run(data)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/mirge.py", line 25, in run
    lib = _find_lib(sample)
  File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/mirge.py", line 73, in _find_lib
    if not libs:
NameError: global name 'libs' is not defined
lpantano commented 6 years ago

Hi,

I pushed a fix today for this, but the short story is that you need to set up the lib parameter that is plugin into mirge.

Right now, mirge has to be installed manually, still working on this, it will take a while because it has very restricted dependency versions so I am trying to update that in the source package.

As well, you need to download the lib library and set it up as explained here:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq

Let me know if you find more issues.

Thanks

On Apr 25, 2018, at 6:56 PM, Marion notifications@github.com wrote:

Ok I think it got a bit further but now it is complaining about not finding libs

[2018-04-25T22:27Z] ['gff', '--sps', 'hsa', '--hairpin', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa', '--gtf', '~/software/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', '~/work/bcbiotx/tmptDj1z0', '~/work/mirbase/m06174_s_2_CGTGAT_trimmed/m06174_s_2_CGTGAT_trimmed.mirna'] [2018-04-25T22:27Z] Looking for mirdeep2 database for m06174_s_2_CGTGAT_trimmed [2018-04-25T22:27Z] Resource requests: seqcluster; memory: 8.00; cores: 1 [2018-04-25T22:27Z] Configuring 1 jobs to run, using 1 cores each with 8.00g of memory reserved for each job [2018-04-25T22:27Z] Timing: cluster [2018-04-25T22:27Z] multiprocessing: seqcluster_cluster Traceback (most recent call last): File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in main(kwargs) File "~/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 336, in smallrnaseqpipeline samples = run_parallel("seqcluster_cluster", [samples]) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator): File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init self.results = batch() File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper return apply(f, args, *kwargs) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 159, in seqcluster_cluster return seqcluster.run_cluster(args) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 102, in run_cluster sample["mirge"] = mirge.run(data) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/mirge.py", line 25, in run lib = _find_lib(sample) File "~/software/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/mirge.py", line 73, in _find_lib if not libs: NameError: global name 'libs' is not defined — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2379#issuecomment-384459452, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HPSWHPRqsDCij9bP0a1i151DigC4ks5tsP8ugaJpZM4Tj6yX.

mshadbolt commented 6 years ago

Hi Lorena

Yes I realised later that I hadn't set the -lib path correctly in my yaml file. I then ran into issues with installing miRge as initially I installed locally but usually need to unset my local library to get bcbio to run. I was able to work around it by creating a virtualenv with mirge and its package dependencies installed then running bcbio within that. I managed to get mirge to run successfully BUT it only ran on one of my samples instead of all 3 that I had included in the .yaml file.

Perhaps the problem is in the file at ~/work/mirge/sample_file.txt ? This file only contains the path to one of my samples but I believe it should be how you specify multiple samples to mirge?

It ran the other parts of the pipeline, STAR and seqbuster, on all three samples.

Thanks again for your help :)

lpantano commented 6 years ago

Hi,

oh, sorry. Yes you are right. There was a bug there. I just pushed a fix for that (hopefully).

Sorry for the painful of installing miRge, I am trying to get it fix that as well.

thanks!

mshadbolt commented 6 years ago

Awesome thanks, I am running now on a bigger set of samples so will let you know if I run into any more issues.

mshadbolt commented 6 years ago

Hi again, not sure if you can help with this one or if it is more a problem with miRge but I found that I ran out of memory when trying to run miRge on a large number of samples. I have set up the bcbio system resource config settings so that they stay within my system's requirements but perhaps because miRge is installed separately it doesn't pay attention to these settings? I couldn't find any settings in their documentation to control memory usage. I guess because it processes everything as one batch and doesn't save any intermediary files to disk and holds them all in memory instead.

FYI I was trying to run it with 372 samples but it failed when it reached 100. I was trying to keep it within 64GB as its a shared system, but in total we have 128GB.

I might try just running in batches with miRge as standalone, or maybe one by one then merge together at the end. Might not be an issue for people who have lots of memory or not many samples but thought it would be good to let you know so you're aware of its memory-hogging ways ;)

lpantano commented 6 years ago

Hi,

The only thing I can do is to add mirge to the tools you can set up memory. If you were running in local mode with 64GB already, then I cannot do much more and I would suggest just to open an issue for them. Maybe there are interesting in debugging this.

I pushed the fix and now if you setup mirge in the bcbio_symtem.yaml file with the memory you want, it should allocate that to that specific job, right now it was 8G following seqcluster recommendation when running in ipython mode.

I will try to reproduce as well, I think I have one project with similar number of samples.

Thanks!

roryk commented 5 years ago

Thanks, looks like Lorena fixed this issue and there hasn't been any action, so closing.