CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
166 stars 36 forks source link

Issues running CAMISAM #44

Closed ParkvilleData closed 5 years ago

ParkvilleData commented 5 years ago

HI, I have followed the update documentation but I am having trouble running CAMISAM.

Running the docker command says the mini.biom file is not available. I try entering the docker container using bash but that also gives me an error. I also try running "python metagenomesimulation.py configuration/metagenome_simulation" but it's unclear where that configuration is.. I thought it might be the config file which I have edited to suit but that doesn't work. would I be able to get some help getting this to run please?

`bshaban@6300d-111439-l:~/camisim$ sudo docker run -it -v "/path/to/input/directory:/input:rw" -v "/path/to/output/directory:/output:rw" cami/camisim:latest metagenome_from_profile.py -p /input/mini.biom -o /output NCBI database not present yet (first time used?) Downloading taxdump.tar.gz from NCBI FTP site... Done. Parsing... Loading node names... 2044492 names loaded. 249789 synonyms loaded. Loading nodes... 2044492 nodes loaded. Linking nodes... Tree is loaded. Updating database: /root/.etetoolkit/taxa.sqlite ... 2044000 generating entries... Uploading to /root/.etetoolkit/taxa.sqlite

Inserting synonyms: 245000 Inserting taxid merges: 50000 Inserting taxids: 2040000 2019-01-16 23:37:51 WARNING: [root] Max strains per OTU not set, using default (3) 2019-01-16 23:37:51 WARNING: [root] Mu and sigma have not been set, using defaults (1,2) Traceback (most recent call last): File "metagenome_from_profile.py", line 87, in config = GG.generate_input(args) # total number of genomes and path to updated config File "/usr/local/bin/scripts/get_genomes.py", line 283, in generate_input tax_profile = read_taxonomic_profile(args.profile, config, args.samples) File "/usr/local/bin/scripts/get_genomes.py", line 26, in read_taxonomic_profile table = biom.load_table(biom_profile) File "/usr/local/lib/python2.7/dist-packages/biom/parse.py", line 652, in load_table with biom_open(f) as fp: File "/usr/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/biom/util.py", line 443, in biom_open if os.path.getsize(fp) == 0: File "/usr/lib/python2.7/genericpath.py", line 57, in getsize return os.stat(filename).st_size OSError: [Errno 2] No such file or directory: '/input/mini.biom'`

AlphaSquad commented 5 years ago

Hi, thank you for your comment. The docker container is not extensively tested, so there still might be some bugs. In this case you are missing a biom file which you would have to put in the correct path, the "mini.biom" is just an example and not provided along CAMISIM. Additionally, you have to change the path you are mounting.

"/path/to/input/directory:/input:rw"

mounts the path before the : to the path after within the docker container and probably your input is not located in the path /path/to/input/directory locally. I would kindly advise to check the docker manual for details on how to run and mount docker containers.

When running CAMISIM directly outside of a docker container, you will also need a biom profile if you want to run ./metagenome_from_profile.py. If you don't have a profile, or you want to run CAMISIM without one (using ./metagenomesimulation.py), you will need a set of genomes and their taxonomic classification and need to set the files in your config file appropriately. The default configuration is not stored under

configuration/metagenome_simulation

but under

defaults/default_config.ini

which you need to change according to your needs. For details how to do this, please refer to the wiki

ParkvilleData commented 5 years ago

Thanks for the reply!

Yep, thought as much. That's what I initially did but the pipeline broke down much earlier than when I used your generic command. When I ran with my input locations the pipeline broke before it even downloaded the taxonomic files from ncbi..

I'll run it again and post what I did to get it to work.

Thanks!

ParkvilleData commented 5 years ago

Ok, I have doubled checked.

I am using the following command.

python metagenomesimulation.py default_config.ini

In the config the following genome classifications are being used.

metadata=documentation/CAMI2015_metadata_final.tsv id_to_genome_file=documentation/CAMI2015_paths.tsv ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz

A local version of samtools is set. All dependencies are installed.

The mode parameter wasn't set and I set it to replicates.

I get the following error bshaban@6300d-111439-l:~/camisim$ python metagenomesimulation.py default_config.ini Traceback (most recent call last): File "metagenomesimulation.py", line 14, in <module> from scripts.argumenthandler import ArgumentHandler File "/home/unimelb.edu.au/bshaban/camisim/scripts/argumenthandler.py", line 9, in <module> import numpy.random as np_random File "/home/unimelb.edu.au/bshaban/.local/lib/python2.7/site-packages/numpy/__init__.py", line 142, in <module> from . import core File "/home/unimelb.edu.au/bshaban/.local/lib/python2.7/site-packages/numpy/core/__init__.py", line 59, in <module> from . import numeric File "/home/unimelb.edu.au/bshaban/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 3093, in <module> from . import fromnumeric File "/home/unimelb.edu.au/bshaban/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 17, in <module> from . import _methods File "/home/unimelb.edu.au/bshaban/.local/lib/python2.7/site-packages/numpy/core/_methods.py", line 158, in <module> _NDARRAY_ARRAY_FUNCTION = mu.ndarray.__array_function__ AttributeError: type object 'numpy.ndarray' has no attribute '__array_function__'

This is the same error I was receiving yesterday. I've double checked the config file and everything seems like it has everything it needs. I've checked the dependencies and they're installed.

I am using python 2.7.15 and the min is python 2.7.10? It seems to be a python error (i'm not well versed in python) should I use a newer version? Maybe 3?

bshaban@6300d-111439-l:~/camisim$ python Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)

Thank you very much for your help, Bobbie.

AlphaSquad commented 5 years ago

The python version should not be a problem, for me it looks more like it could be your numpy version. The pipeline progresses further in the docker container, because all of the dependencies are automatically installed there. Could you please check your numpy version?

$ python Python 2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org >>> import numpy >>> numpy.version.version '1.13.0'

As you can see, my python version here is 2.7.12 and it is running. If you have numpy < 1.13.0 make sure to install the dependencies, e.g. using

pip install -r requirements.txt

And in addition, you probably do not have the genomes from the CAMI1 challenge downloaded I assume? If you don't, then the paths in the CAMI_2015_paths.tsv file do not point to anything. After you have made sure that you have a correct version of numpy, could you post the output of

./metagenome_from_profile.py -p defaults/mini.biom -o out

here?

ParkvilleData commented 5 years ago

HI, thanks for the reply!

I had numpy 1.16 which resulted in the error that I posted originally. I removed numpy and used the

"pip install -r requirements" to install numpy 1.13.0

This ended up going further through the process.

The error I received this time was

bshaban@6300d-111439-l:~/camisim$ python ./metagenome_from_profile.py -p defaults/mini.biom -o out
NCBI database not present yet (first time used?)
Downloading taxdump.tar.gz from NCBI FTP site...
Done. Parsing...
Loading node names...
2045235 names loaded.
249975 synonyms loaded.
Loading nodes...
2045235 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /dir/.etetoolkit/taxa.sqlite ...
2045000 generating entries...
Uploading to /dir/.etetoolkit/taxa.sqlite

Inserting synonyms:      245000
Inserting taxid merges:  50000
Inserting taxids:       2045000
2019-01-21 13:37:15 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-21 13:37:15 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2019-01-21 13:37:16 WARNING: [root] Some OTUs could not be mapped
ERROR: <type 'NoneType'>

Indeed that was the case. I added the locations of the genomes to the genome_path.csv and I received the same "ERROR: <type 'NoneType'>` as the metagenome_from_profile.py output.

I also tried the same command with a biom I created with qiime and that gave the following error.

bshaban@6300d-111439-l:~/camisim$ python ./metagenome_from_profile.py -p Rcombined.noempty.all.4.40.2013_08_greengenes_97_otus.with_euks_L6.biom -o out
2019-01-21 13:55:05 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-21 13:55:05 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
Traceback (most recent call last):
  File "./metagenome_from_profile.py", line 87, in <module>
    config = GG.generate_input(args) # total number of genomes and path to updated config
  File "/home/unimelb.edu.au/bshaban/camisim/scripts/get_genomes.py", line 283, in generate_input
    tax_profile = read_taxonomic_profile(args.profile, config, args.samples)
  File "/home/unimelb.edu.au/bshaban/camisim/scripts/get_genomes.py", line 42, in read_taxonomic_profile
    lineage = table.metadata(otu,axis="observation")["taxonomy"]
TypeError: 'NoneType' object has no attribute '__getitem__'
Exception AttributeError: "'NoneType' object has no attribute '_map_logfile_handler'" in <bound method LoggingWrapper.__del__ of <scripts.loggingwrapper.LoggingWrapper object at 0x7f670eb46090>> ignored

Thank you very much for your patience and help, it is much appreciated.

AlphaSquad commented 5 years ago

It seems that your own QIIME profile does not have a taxonomy attached to it; without that CAMISIM will unfortunately not be able to infer a metagenome profile.

The other error is...odd. Could you retry and post the output of the same command with the --debug flag, this should yield the exact code position where this error occurs.

ParkvilleData commented 5 years ago

Hi, thanks for that. Here is is.

bshaban@6300d-111439-l:~/camisim$ ./metagenome_from_profile.py -p defaults/mini.biom -o out --debug
2019-01-22 10:44:09 INFO: [root] Using commands:
2019-01-22 10:44:09 INFO: [root] -profile: defaults/mini.biom
2019-01-22 10:44:09 INFO: [root] -tmp: None
2019-01-22 10:44:09 INFO: [root] -ncbi: tools/ncbi-taxonomy_20170222.tar.gz
2019-01-22 10:44:09 INFO: [root] -reference_genomes: tools/assembly_summary_complete_genomes.txt
2019-01-22 10:44:09 INFO: [root] -o: out
2019-01-22 10:44:09 INFO: [root] -no_replace: True
2019-01-22 10:44:09 INFO: [root] -seed: None
2019-01-22 10:44:09 INFO: [root] -additional_references: None
2019-01-22 10:44:09 INFO: [root] -samples: None
2019-01-22 10:44:09 INFO: [root] -debug: True
2019-01-22 10:44:09 INFO: [root] -config: defaults/default_config.ini
2019-01-22 10:44:09 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-22 10:44:09 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2019-01-22 10:44:09 WARNING: [root] Some OTUs could not be mapped
2019-01-22 10:44:09 WARNING: [root] Rank order of OTU Genome3 too high, no matching genomes found
2019-01-22 10:44:09 WARNING: [root] Full lineage was [91347, 1236, 1224, 2], mapped from BIOM lineage [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Enterobacterales']
2019-01-22 10:44:09 INFO: [root] Downloading 4 genomes
ERROR: <type 'NoneType'>

I seem to have now got the Docker example running. I will see If I can run metagenomiesimulation.py through docker as well and post an update.

Thanks!

ParkvilleData commented 5 years ago

The docker metagenome from profile command worked and I will run again soon. What I really need is to run against a set of genomes, i.e. de novo.

The manual says there are three files that are needed to run the community design de novo, but I don't see much difference between files one and two? One of the files contains the genome paths, the second contains the metadata, what does the third contain? Is it possible to put links to examples of each three in the documentation?

I run the metagenomesimulation.py with my default_config.ini which contains the appropriate paths. When I run the command metagenomesimulation.py default_config.ini I get the same "ERROR: <type 'NoneType'>" error. I have tried to run this with the debug parameter but it doesn't give any further information.

Thanks for the help,

AlphaSquad commented 5 years ago

Peculiar. Could you try one of these two small things?

  1. Creating the out/ folder before running the command (if it is not present)
  2. If the folder is present, checking whether any genomes were downloaded (if the out folder is present)

If the out folder is not present, it might explain why the docker is running, since that automatically mounts the out/-folder. I've encountered the "ERROR: <type 'NoneType'>" sometimes at the end of the pipeline and it didn't cause any damage. If you check the File Formats page it should explain the two required files, metadata and genome_to_id. I am not sure which third file you refer to, all other files but the two aformentioned ones are optional. You will need to have downloaded a set of genomes to run CAMISIM de novo though.

ParkvilleData commented 5 years ago

Hi,

Re: The three files. In the manual it says the following

The de novo community design needs three files to run:

A file containing, tab separated, a genome identifier and that path to the file of the genome.

A file containing, tab separated, a genome identifier and that path to the gen annotation of genome. This one is uses in case strains are simulated based on a genome

A [[meta data file|meta-data-file-format] that contains, tab separated and with header, genome identifier, novelty categorization, otu assignment and a taxonomic classification.

Is there a third file that needs to be linked to a gff annotation for the genome?

This is where I am confused about the three files. I had a fasta file with the genomes I wanted to run the de novo analysis on. I split the fasta into separate fasta genomes each containing one genome and updated the genome map, default_config.ini and metadata files accordingly so I'm not really sure what I am doing wrong.

With points 1 & 2. The folder was present, I renamed it and then created a new out folder and ran the command again. The output of the out folder is as follows.

-rw-rw-r-- 1 bshaban bshaban   42 Jan 23 10:58 abundance0.tsv
-rw-rw-r-- 1 bshaban bshaban   65 Jan 23 10:58 abundance1.tsv
-rw-rw-r-- 1 bshaban bshaban  888 Jan 23 10:58 config.ini
-rw-rw-r-- 1 bshaban bshaban  157 Jan 23 10:58 genome_to_id.tsv
-rw-rw-r-- 1 bshaban bshaban  135 Jan 23 10:58 metadata.tsv

genomes:
total 15M
drwxrwxr-x 2 bshaban bshaban 4.0K Jan 23 10:58 .
drwxrwxr-x 3 bshaban bshaban 4.0K Jan 23 10:58 ..
-rw-rw-r-- 1 bshaban bshaban 5.1M Jan 23 10:58 GCA_000210475.1_ASM21047v1.fa
-rw-rw-r-- 1 bshaban bshaban 4.4M Jan 23 10:58 GCA_000800765.1_ASM80076v1.fa
-rw-rw-r-- 1 bshaban bshaban 4.8M Jan 23 10:58 GCA_001051135.1_ASM105113v1.fa
AlphaSquad commented 5 years ago

Hi, you do not need the second file, which would be the gff annotation of the genome. These files just prevent our genome evolver to evolve within predicted genes of your provided genomes. Only if you want to simulate your own strains and these strains should not have evolved sequences within genomes, this file needs to be provided. For starters you will only need the first and third file you described above. Note that the out folder has to be empty (I should state that somewhere in the manual), but since you re-created it, that shouldn't be a problem. So CAMISIM downloaded the genomes and created all needed files but then crashed without comment? That didn't occur on any the machines I ran CAMISIM before. Could you please send me again

  1. The exact output of CAMISIM with the --debug flag and an empty out/ folder at the start of the run
  2. Your system specs and OS + version
ParkvilleData commented 5 years ago

Hi,

CAMISIM seems to download the genomes, I haven't checked if they're complete but they look to be of reasonable file size. The output is below.

bshaban@6300d-111439-l:~/camisim$ python ./metagenome_from_profile.py -p defaults/mini.biom -o out --debug
2019-01-24 09:40:47 INFO: [root] Using commands:
2019-01-24 09:40:47 INFO: [root] -profile: defaults/mini.biom
2019-01-24 09:40:47 INFO: [root] -tmp: None
2019-01-24 09:40:47 INFO: [root] -ncbi: tools/ncbi-taxonomy_20170222.tar.gz
2019-01-24 09:40:47 INFO: [root] -reference_genomes: tools/assembly_summary_complete_genomes.txt
2019-01-24 09:40:47 INFO: [root] -o: out
2019-01-24 09:40:47 INFO: [root] -no_replace: True
2019-01-24 09:40:47 INFO: [root] -seed: None
2019-01-24 09:40:47 INFO: [root] -additional_references: None
2019-01-24 09:40:47 INFO: [root] -samples: None
2019-01-24 09:40:47 INFO: [root] -debug: True
2019-01-24 09:40:47 INFO: [root] -config: defaults/default_config.ini
2019-01-24 09:40:47 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-24 09:40:47 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2019-01-24 09:40:48 WARNING: [root] Some OTUs could not be mapped
2019-01-24 09:40:48 WARNING: [root] Rank order of OTU Genome3 too high, no matching genomes found
2019-01-24 09:40:48 WARNING: [root] Full lineage was [91347, 1236, 1224, 2], mapped from BIOM lineage [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Enterobacterales']
2019-01-24 09:40:48 INFO: [root] Downloading 3 genomes
ERROR: <type 'NoneType'>

I am using Ubuntu

bshaban@6300d-111439-l:~/camisim$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

System specs

          description: System memory
          physical id: 0
          size: 23GiB
     *-cpu
          product: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          size: 3721MHz
          capacity: 3900MHz
AlphaSquad commented 5 years ago

Hm. Could you go into the default_config.ini and reduce the (sample) size to maybe 0.1 Gbp, see if that changes anything? Maybe you do not have enough RAM or hard drive space, even though CAMISIM should report this. Because it works fine on my Ubuntu system:

$ ./metagenome_from_profile.py -p defaults/mini.biom -o out --debug
2019-01-24 10:14:24 INFO: [root] Using commands:
2019-01-24 10:14:24 INFO: [root] -profile: defaults/mini.biom
2019-01-24 10:14:24 INFO: [root] -tmp: None
2019-01-24 10:14:24 INFO: [root] -ncbi: tools/ncbi-taxonomy_20170222.tar.gz
2019-01-24 10:14:24 INFO: [root] -reference_genomes: tools/assembly_summary_complete_genomes.txt
2019-01-24 10:14:24 INFO: [root] -o: out
2019-01-24 10:14:24 INFO: [root] -no_replace: True
2019-01-24 10:14:24 INFO: [root] -seed: None
2019-01-24 10:14:24 INFO: [root] -additional_references: None
2019-01-24 10:14:24 INFO: [root] -samples: None
2019-01-24 10:14:24 INFO: [root] -debug: True
2019-01-24 10:14:24 INFO: [root] -config: defaults/default_config.ini
2019-01-24 10:14:24 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-24 10:14:24 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2019-01-24 10:14:24 WARNING: [root] Some OTUs could not be mapped
2019-01-24 10:14:24 WARNING: [root] Rank order of OTU Genome3 too high, no matching genomes found
2019-01-24 10:14:24 WARNING: [root] Full lineage was [91347, 1236, 1224, 2], mapped from BIOM lineage [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Enterobacterales']
2019-01-24 10:14:24 INFO: [root] Downloading 3 genomes
2019-01-24 10:14:33 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting
[...]

Were you able to run the docker container by now?

ParkvilleData commented 5 years ago

Hi,

Yes, the docker container worked and produced results after completing. Running with the sample size as 0.1 still gives the same error. I have a 24 Gb machine and have just got another 32Gb which I can try on. Why would the docker container work if I don't have enough RAM? I don't think it's that anyway, this is the output I get from using time -v

Command being timed: "python ./metagenome_from_profile.py -p defaults/mini.biom -o out --debug"
        User time (seconds): 1.54
        System time (seconds): 0.87
        Percent of CPU this job got: 24%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.96
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 137268
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 100007
        Voluntary context switches: 824
        Involuntary context switches: 82
        Swaps: 0
        File system inputs: 0
        File system outputs: 72600
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I've played around with the permissions and ran with sudo but that doesn't seem to do anything either. I'll keep looking into this today. Thanks for your help.

AlphaSquad commented 5 years ago

Yeah, you are right, it would not work in the docker either if it was a problem with RAM. At this point I am a little bit at a loss. I will try to build a docker with your system/software specs to see if I can reproduce this. The output from -v is also strange, since it reports exit status 0 which to my knowledge means that python terminated regularly/without exception.

ParkvilleData commented 5 years ago

HI, thanks for that, that's much appreciated! Would be possible to add a bash entry point to the docker container so I can enter and see an environment where everything is in the right place? That would be very helpful.

Thanks again for all your help.