Data installation fails at GGD srnaseq

jefarrar commented 8 years ago

Creating manifest of installed packages in /opt/bcbio/manifest Third party tools upgrade complete. Installing additional tools Upgrading bcbio-nextgen data files Initialized empty Git repository in /opt/tmpbcbio-install/cloudbiolinux/.git/ remote: Counting objects: 12300, done. remote: Compressing objects: 100% (6/6), done. remote: Total 12300 (delta 0), reused 0 (delta 0), pack-reused 12294 Receiving objects: 100% (12300/12300), 7.95 MiB | 3.22 MiB/s, done. Resolving deltas: 100% (7026/7026), done. Setting up virtual machine INFO: <cloudbio.flavor.Flavor instance at 0x7f68f182b170> INFO: <cloudbio.flavor.Flavor instance at 0x7f68f182b170> INFO: This is a ngs_pipeline_minimal flavor INFO: This is a ngs_pipeline_minimal flavor INFO: Distribution auto INFO: Distribution auto INFO: Get local environment INFO: Get local environment INFO: CentOS setup INFO: CentOS setup WARN [distribution.py(216)]: NixPkgs are currently not supported for centos WARN [distribution.py(216)]: NixPkgs are currently not supported for centos DBG [distribution.py]: NixPkgs: Ignored DBG [distribution.py]: NixPkgs: Ignored [localhost] local: echo $HOME [localhost] local: uname -m INFO: Now, testing connection to host... INFO: Now, testing connection to host... INFO: Connection to host appears to work! INFO: Connection to host appears to work! DBG [utils.py]: Expand paths DBG [utils.py]: Expand paths INFO: List of genomes to get (from the config file at '{'install_liftover': False, 'annotation_groups': {'rnaseq': ['transcripts', 'RADAR'], 'smallrna': ['mirbase'], 'variation': ['problem_regions', 'GA4GH_problem_regions', 'MIG', 'prioritize', 'dbsnp', 'hapmap', '1000g_omni_snps', '1000g_snps', 'mills_indels', '1000g_indels', 'clinvar', 'cosmic', 'ancestral', 'qsignature']}, 'genome_indexes': ['rtg'], 'genomes': [{'annotations': ['mirbase', 'GA4GH_problem_regions', 'MIG', 'prioritize', 'dbsnp', 'hapmap', '1000g_omni_snps', '1000g_snps', 'mills_indels', 'cosmic', 'ancestral', 'qsignature', 'transcripts', 'RADAR'], 'validation': ['giab-NA12878', 'dream-syn3', 'dream-syn4'], 'name': 'Human (GRCh37)', 'dbkey': 'GRCh37', 'annotations_available': ['battenberg', 'dbnsfp']}], 'install_uniref': False}'): Human (GRCh37) INFO: List of genomes to get (from the config file at '{'install_liftover': False, 'annotation_groups': {'rnaseq': ['transcripts', 'RADAR'], 'smallrna': ['mirbase'], 'variation': ['problem_regions', 'GA4GH_problem_regions', 'MIG', 'prioritize', 'dbsnp', 'hapmap', '1000g_omni_snps', '1000g_snps', 'mills_indels', '1000g_indels', 'clinvar', 'cosmic', 'ancestral', 'qsignature']}, 'genome_indexes': ['rtg'], 'genomes': [{'annotations': ['mirbase', 'GA4GH_problem_regions', 'MIG', 'prioritize', 'dbsnp', 'hapmap', '1000g_omni_snps', '1000g_snps', 'mills_indels', 'cosmic', 'ancestral', 'qsignature', 'transcripts', 'RADAR'], 'validation': ['giab-NA12878', 'dream-syn3', 'dream-syn4'], 'name': 'Human (GRCh37)', 'dbkey': 'GRCh37', 'annotations_available': ['battenberg', 'dbnsfp']}], 'install_uniref': False}'): Human (GRCh37) Running GGD recipe: srnaseq Traceback (most recent call last): File "/opt/bcbio/bin/bcbio_nextgen.py", line 4, in import('pkg_resources').run_script('bcbio-nextgen==0.9.6', 'bcbio_nextgen.py') File "/opt/bcbio/anaconda/lib/python2.7/site-packages/setuptools-20.2.2-py2.7.egg/pkg_resources/init.py", line 726, in run_script

File "/opt/bcbio/anaconda/lib/python2.7/site-packages/setuptools-20.2.2-py2.7.egg/pkg_resources/init.py", line 1484, in run_script

File "/opt/bcbio/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6-py2.7.egg-info/scripts/bcbio_nextgen.py", line 207, in install.upgrade_bcbio(kwargs["args"]) File "/opt/bcbio/anaconda/lib/python2.7/site-packages/bcbio/install.py", line 89, in upgrade_bcbio upgrade_bcbio_data(args, REMOTES) File "/opt/bcbio/anaconda/lib/python2.7/site-packages/bcbio/install.py", line 257, in upgrade_bcbio_data cbl_deploy.deploy(s) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/deploy/init.py", line 65, in deploy _setup_vm(options, vm_launcher, actions) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/deploy/init.py", line 110, in _setup_vm configure_instance(options, actions) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/deploy/init.py", line 268, in configure_instance setup_biodata(options) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/deploy/init.py", line 250, in setup_biodata install_proc(options["genomes"], ["ggd", "s3", "raw"]) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 345, in install_data _prep_genomes(env, genomes, genome_indexes, ready_approaches) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 474, in _prep_genomes retrieve_fn(env, manager, gid, idx) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 796, in _install_with_ggd ggd.install_recipe(env.cwd, recipe_file) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 31, in install_recipe _move_files(tmpdir, base_dir, recipe["recipe"]["full"]["recipe_outfiles"]) File "/opt/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 74, in _move_files (out_file, tmp_dir)) AssertionError: Did not find expected output file srnaseq/Summary_Counts.all_predictions.txt in /opt/bcbio/genomes/Hsapiens/GRCh37/txtmp

lpantano commented 8 years ago

Hi @jefarrar

sorry about the issue. I am trying to replicate it. Meanwhile can you go inside /opt/bcbio/genomes/Hsapiens/GRCh37/txtmp and see if there is any file similar to that name? maybe was a problem downloading the file.

will let you know if I can find the same problem.

jefarrar commented 8 years ago

you guys are fast!

There is a summary counts zipped file with similar name, but it (along with another .zip archive) looks empty: ls -l /opt/bcbio/genomes/Hsapiens/hg19/txtmp total 8 -rw-rw-r-- 1 xxx xxx 2711 Mar 10 19:19 ggd-run.sh drwxrwxr-x 2 xxx xxx 4096 Mar 10 19:19 srnaseq

ls -l /opt/bcbio/genomes/Hsapiens/hg19/txtmp/srnaseq total 150876 -rw-rw-r-- 1 xxx xxx 1207144 Mar 10 19:16 hairpin.fa.gz -rw-rw-r-- 1 xxx xxx 519390 Mar 10 19:15 hsa.gff3 -rw-rw-r-- 1 xxx xxx 590888 Mar 10 19:16 mature.fa.gz -rw-rw-r-- 1 xxx xxx 0 Mar 10 19:16 miR_Family_Info.txt.zip -rw-rw-rw- 1 xxx xxx 363739 Jun 25 2014 mirna_mature.txt.gz -rw-rw-r-- 1 xxx xxx 2640341 Mar 10 19:16 miRNA.str.gz -rw-rw-r-- 1 xxx xxx 5541832 Mar 6 08:57 refGene.txt.gz -rw-rw-r-- 1 xxx xxx 143401637 Apr 27 2009 rmsk.txt.gz -rw-rw-r-- 1 xxx xxx 0 Mar 10 19:16 Summary_Counts.txt.zip -rw-rw-r-- 1 xxx xxx 16473 Dec 21 06:01 tRNAs.txt.gz -rw-rw-r-- 1 xxx xxx 19933 Oct 3 2010 wgRna.txt.gz

This is on a new centOS install. I upgraded data on an established ubuntu install a few minutes ago and this doesn't seem to be a problem there:

Running GGD recipe: srnaseq Running GGD recipe: prioritize --2016-03-10 18:24:58-- https://s3.amazonaws.com/biodata/coverage/prioritize/prioritize-cancer-hg19-20160215.tar.gz

lpantano commented 8 years ago

it seems connection dropped. I would remove that zip file and restart, and cross fingers :) is not failing for me in my computer.

let me know if restarting does something.

jefarrar commented 8 years ago

Pretty sure that transient connection issues weren't my problem; I've been contorting with this for the past couple of days.

My first thought after finding the empty zip archives was maybe this was a proxy issue in front of the new install. But I was able to manually download the two .zip targets from targetscan.org from behind the proxy without issues. However, this issue persisted even when I manually placed and unpacked in these in the txtmp/srnaseq folder. In any event, I think I've managed to get around this issue by copying newly updated srnaseq folders in from another system.

thanks!

kyzhao commented 7 years ago

Seems the problem is still there for GGD srnaseq, It's the invalid targetscan version70 data causing the problem. I manually downloaded the version71 file in the folder srnaseq : Summary_Counts.all_predictions.txt, then it's ok, not sure which website hosts the ggd.sh code. Update that code should fix the issue.

chapmanb commented 7 years ago

Thanks for the report and sorry about the download issues. Could you provide the error message you're seeing? I ran an update and the srnaseq recipe worked cleanly for me. Manually checking the targetscan files, they do seem to be present:

wget http://www.targetscan.org/vert_70/vert_70_data_download/Summary_Counts.all_predictions.txt.zip

Is it possible this was a transient error and re-running fixes the issue?

I'll wait to update to targetscan 71 until @lpantano has a chance to validate those files work correctly with the pipeline.

Thanks for helping to debug.

kyzhao commented 7 years ago

Thanks for the fast response, Brad!

I was stuck at the ggd several times during installation: subprocess.CalledProcessError: Command '['bash', '~/local/bcbio/share/bcbio-nextgen/genomes/Hsapiens/GRCh37/txtmp/ggd-run.sh']' returned non-zero exit status 2 with empty Summary_Counts.all_predictions.txt. Seems to be a web stability issue. I tried it again just now and version 70 worked as well, with painful speed.

This big file Summary_Counts.all_predictions.txt from targetscan seems to cause network issue with many people. (Probably some problem with targetscan host website, I would say. )

Based on the installation guide, I was using "bcbio_nextgen.py upgrade --tooldir=~/local/bcbio --isolate --genomes GRCh37 --aligners bwa --data" to continue installation, if the installation quit during large data file download due to network issue.

Is it possible to separate the tool installation and data installation in 2 clean parts and provide an easier way to update genomic data for each major components ? (Maybe there is already an easy way for data only in the installation, just I don't know it. )

Thanks!

chapmanb commented 7 years ago

Thanks for the feedback, I'm glad to hear that it worked in the end. I agree that the download here is pretty slow due to server speed, we'll look at caching a version in s3 to avoid this and any downtimes at targetscan or mirbase.

For upgrades, you can definitely do tools and data separately. You'd want to leave the --tooldir and --isolate`` arguments out of your command the and then it will skip right to the data installation. In general neither argument should be needed after a successful install since bcbio caches them for future use. You can then upgrade just tools with--toolsor only data with--data`. Hope this helps.

lpantano commented 7 years ago

Sorry about this. I updated the new version and I will work to create a tar file in s3 to avoid this problem in the future as @chapmanb suggested.

cheers

kyzhao commented 7 years ago

Thanks a lot for the advice and fast update!

bcbio / bcbio-nextgen

Data installation fails at GGD srnaseq #1267