CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
171 stars 37 forks source link

Errors when running from_profile default #178

Closed CassandraHjo closed 10 months ago

CassandraHjo commented 11 months ago

I tried to run the default from_profile simulation using this command

python metagenome_from_profile.py -p defaults/mini.biom

but after a very short time of running I get this output:

2023-12-08 16:00:26 WARNING: [root] 1921510 taxid not found
2023-12-08 16:00:26 WARNING: [root] 1921552 taxid not found
2023-12-08 16:00:26 WARNING: [root] 1926494 taxid not found
2023-12-08 16:00:26 WARNING: [root] 1933220 taxid not found
2023-12-08 16:00:26 WARNING: [root] 1936080 taxid not found
2023-12-08 16:00:26 WARNING: [root] 1936081 taxid not found
2023-12-08 16:00:26 WARNING: [root] Some OTUs could not be mapped
2023-12-08 16:00:41 ERROR: [Community] Invalid digit, must be bigger than 1, but was 0
2023-12-08 16:00:41 ERROR: [MetagenomeSimulationPipeline] [community0] Has an invalid value!
2023-12-08 16:00:42 ERROR: [Validator 60814336567] Insufficient space! 12.49gb of 30.00gb available at '/tmp'
ERROR: name 'raw_input' is not defined

There are multiple errors here.

AlphaSquad commented 11 months ago

metagenome_from_profile.py by default uses defaults/default_config.ini, you can change this using the -c option. My guess here is that you edited the default_config.ini and this is why you are seeing these errors. If you run metagenome_from_profile.py without parameters it will print a help screen where you can check all parameters and the default values.

CassandraHjo commented 11 months ago

I tried to run it again with is command:

python metagenome_from_profile.py -p defaults/mini.biom -c defaults/default_config_v2.ini

where default_config_v2.ini is the file bellow:

[Main]
# maximum number of processes
max_processors=8

# 0: community design + read simulator,
# 1: read simulator only
phase=0

# ouput directory, where the output will be stored (will be overwritten if set in from_profile)
output_directory=/cluster/projects/nn9383k/cassandh/camisim_out

# temporary directory
temp_directory=/cluster/projects/nn9383k/cassandh/tmp

# gold standard assembly
gsa=True

# gold standard for all samples combined
pooled_gsa=True

# anonymize sequences?
anonymous=True

# compress data (levels 0-9, recommended is 1 the gain of higher levels is not too high)
compress=1

# id of dataset, used in foldernames and is prefix in anonymous sequences
dataset_id=RL

# Read Simulation settings, relevant also for from_profile
[ReadSimulator]
# which readsimulator to use:
#           Choice of 'art', 'wgsim', 'nanosim', 'pbsim'
type=art

# Samtools (http://www.htslib.org/) takes care of sam/bam files. Version 1.0 or higher required!
# file path to executable
samtools=/cluster/software/SAMtools/1.14-GCC-11.2.0/bin/samtools

# file path to read simulation executable
readsim=tools/art_illumina-2.3.6/art_illumina

profile=mbarc

# Directory containing error profiles (can be blank for wgsim)
error_profiles=tools/art_illumina-2.3.6/profiles/

#paired end read, insert size (not applicable for nanosim)
fragments_size_mean=270
fragment_size_standard_deviation=27

# Only relevant if not from_profile is run:
[CommunityDesign]
# specify the samples size in Giga base pairs
size=0.1

# how many different samples?
number_of_samples=2

# how many communities
num_communities=1

# directory containing the taxdump of ncbi, version from 22.02.2017 is shipped
# "nodes.dmp"
# "merged.dmp"
# "names.dmp"
ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz

# the strain simulator for de novo strain creation
strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir/

# define communities: [community<integer>]
[community0]
# information about all included genomes:
# can be used for multiple samples
metadata=defaults/metadata.tsv
id_to_genome_file=defaults/genome_to_id.tsv

# how many genomes do you want to sample over all?
genomes_total=2
num_real_genomes=2

# how many genomes per species taxon
#   (species taxon will be replaced by OTU-cluster later on)
max_strains_per_otu=1
ratio=1

# which kind of different samples do you need?
#   replicates / timeseries_lognormal / timeseries_normal / differential
mode=differential

# Part: community design
# Set parameters of log-normal and normal distribution, number of samples
# sigma > 0; influences shape (higher sigma -> smaller peak and longer tail),
log_sigma=2

# mu (real number) is a parameter for the log-scale
log_mu=1

# do you want to see a distribution before you decide to use it? yes/no
view=no

But I get this error after about 30 seconds:

2023-12-09 12:29:16 INFO: [root] Patching NCBITaxa's base methods. For reason, see https://github.com/etetoolkit/ete/issues/469.

2023-12-09 12:29:17 INFO: [root] Patch finished.
2023-12-09 12:29:17 WARNING: [root] Max strains per OTU not set, using default (3)
2023-12-09 12:29:17 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2023-12-09 12:29:17 WARNING: [root] 438753 taxid not found
2023-12-09 12:29:17 WARNING: [root] 198804 taxid not found
...
...
...
2023-12-09 12:29:24 WARNING: [root] 1926494 taxid not found
2023-12-09 12:29:24 WARNING: [root] 1933220 taxid not found
2023-12-09 12:29:24 WARNING: [root] 1936080 taxid not found
2023-12-09 12:29:24 WARNING: [root] 1936081 taxid not found
2023-12-09 12:29:25 WARNING: [root] Some OTUs could not be mapped
2023-12-09 12:29:28 ERROR: [Community] Invalid digit, must be bigger than 1, but was 0
2023-12-09 12:29:28 ERROR: [MetagenomeSimulationPipeline] [community0] Has an invalid value!
2023-12-09 12:29:28 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

max_strains_per_otu=1

2023-12-09 12:29:28 ERROR: [Community] Invalid digit, must be bigger than 1, but was 0
2023-12-09 12:29:28 ERROR: [MetagenomeSimulationPipeline] [community0] Has an invalid value!
AlphaSquad commented 11 months ago

As I cannot reproduce this in any way it is somewhat hard to tell. Did you change something in the mini.biom (it looks like it, because of the taxid warnings) or in any CAMISIM code? Regarding your questions metagenome_from_profile.py writes its own config file -based on the input one - and since the parameters are not needed these are not written, this is not a problem. In the first error message you also got Insufficient space! 12.49gb of 30.00gb available at '/tmp' ERROR: name 'raw_input' is not defined - does this error re-occur, too? Also, CAMISIM should have written some preliminary data to /cluster/projects/nn9383k/cassandh/tmp, particularly it should have the config file it is going to use, that should point towards the error.

CassandraHjo commented 11 months ago

I have not changed anything in the mini.biomthat I know of. mini.biomlooks like this:

{"id": "minimal example","format" : "Biological Observation Matrix 1.0.0", "format_url": "http://biom-format.org", "generated_by": "Adrian Fritz", "date": "2017-08-07T15:45:45.454545","matrix_element_type":"float", "shape" :[3,2], "type": "OTU table", "matrix_type": "sparse", "data":[[0,0,0.3],[1,0,0],[2,0,0.6],[0,1,0.2],[1,1,0.3],[2,1,0.4]], "rows":[{"id":"Genome1", "metadata": {"taxonomy" : "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli"}},{"id":"Genome2", "metadata": {"taxonomy":"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia mulleri"}},{"id":"Genome3", "metadata": {"taxonomy" : "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales"}}],"columns":[{"id":"Sample1","metadata": null},{"id":"Sample2","metadata": null}]}

The only thing I have changed in the code is the path to the executable samtools, but this works when running de novo mode. So i do not think that is the problem here.

The error message about insufficent space in tmp directory does not re-occur after I changed the tmp dir.

I tried to run the simulation again with this command:

python metagenome_from_profile.py -p defaults/mini.biom -c defaults/default_config_v2.ini -o /cluster/projects/nn9383k/cassandh/camisim_out -tmp /cluster/projects/nn9383k/cassandh/tmp

and then this config file is created in the output folder (/camisim_out):

[Main]
max_processors = 8
phase = 0
output_directory = /cluster/projects/nn9383k/cassandh/camisim_out/
temp_directory = /cluster/projects/nn9383k/cassandh/tmp/
gsa = True
pooled_gsa = True
anonymous = True
compress = 1
dataset_id = RL
number_of_samples = 2
distribution_file_paths = /cluster/projects/nn9383k/cassandh/camisim_out/abundance0.tsv,/cluster/projects/nn9383k/cassandh/camisim_out/abundance1.tsv

[ReadSimulator]
type = art
samtools = /cluster/software/SAMtools/1.14-GCC-11.2.0/bin/samtools
readsim = tools/art_illumina-2.3.6/art_illumina
profile = mbarc
error_profiles = tools/art_illumina-2.3.6/profiles/
fragments_size_mean = 270
fragment_size_standard_deviation = 27

[CommunityDesign]
size = 0.1
number_of_samples = 2
ncbi_taxdump = tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template = scripts/StrainSimulationWrapper/sgEvolver/simulation_dir/

[community0]
metadata = /cluster/projects/nn9383k/cassandh/camisim_out/metadata.tsv
id_to_genome_file = /cluster/projects/nn9383k/cassandh/camisim_out/genome_to_id.tsv
id_to_gff_file = 
genomes_total = 0
num_real_genomes = 0
max_strains_per_otu = 1
ratio = 1
mode = differential
log_sigma = 2
log_mu = 1
view = no

Do you have any idea why genomes_total = 0 and num_real_genomes = 0 in this created config file? Even tough I use these settings in my input config file:

genomes_total=24
num_real_genomes=24

(I assume genomes_total to be 24 because the metadata file in the default folder contains 24 genomes)

The output directory is not empty, but it only contains the created config file (displayed above), a genomesfolder (which is empty) and a metadata.tsv file (which only contains this header: genome_ID OTU NCBI_ID novelty_category)

AlphaSquad commented 11 months ago

Okay it seems like fetching/mapping the genomes failed - that is why the genomes folder and the metadata folder is empty and CAMISIM fails because genomes_total is set to 0. The question here is why this is happening, because I cannot see it happening on my machines. Probably it is related to the error of taxid not found and thus be related to ete3. I installed an older version of ete3 and I also got the taxid not found error, but still CAMISIM was simulating a data set. Still, as a first step I would recommend updating ete3 to 3.1.3 and then rerunning CAMISIM with the --debug option. Since we ruled out the config file etc. as culprits, I think you can go back to using the default files, i.e. metagenome_from_profile.py -p defaults/mini.biom --debug Thanks for the patience

/edit Remembering your other issue, did you edit the file tools/assembly_summary_complete_genomes.txt to be empty? This is the default file CAMISIM is using and if it is empty it won't be able to find any genomes (and if I use this, I also get the Invalid digit error).

CassandraHjo commented 11 months ago

I have not edited assembly_summary_complete_genomes.txt.

I have created a conda environment with the newest version of ete, and tried to run the simulation like you said with this command (I have to change the tmp directory due to storage):

python metagenome_from_profile.py -p defaults/mini.biom -tmp /cluster/projects/nn9383k/cassandh/tmp --debug

This is the config.ini file in the outfolder:

[Main]
max_processors = 8
phase = 0
output_directory = out/
temp_directory = /cluster/projects/nn9383k/cassandh/tmp/
gsa = True
pooled_gsa = True
anonymous = True
compress = 1
dataset_id = RL
number_of_samples = 2
distribution_file_paths = out/abundance0.tsv,out/abundance1.tsv

[ReadSimulator]
type = art
samtools = /cluster/software/SAMtools/1.14-GCC-11.2.0/bin/samtools
readsim = tools/art_illumina-2.3.6/art_illumina
profile = mbarc
error_profiles = tools/art_illumina-2.3.6/profiles/
base_profile_name = 
profile_read_length = 
fragments_size_mean = 270
fragment_size_standard_deviation = 27

[CommunityDesign]
size = 5
number_of_samples = 1
num_communities = 1
ncbi_taxdump = tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template = scripts/StrainSimulationWrapper/sgEvolver/simulation_dir/

[community0]
metadata = out/metadata.tsv
id_to_genome_file = out/genome_to_id.tsv
genomes_total = 0
num_real_genomes = 0
max_strains_per_otu = 1
ratio = 1
mode = 
log_sigma = 2
log_mu = 1
view = no

The metadata.tsv file is still empty, as well as the genomes folder in the output directory.

The error messages are listed bellow:

2023-12-11 15:28:24 INFO: [root] Patching NCBITaxa's base methods. For reason, see https://github.com/etetoolkit/ete/issues/469.

2023-12-11 15:28:24 INFO: [root] Patch finished.
2023-12-11 15:28:24 INFO: [root] Using commands:
2023-12-11 15:28:24 INFO: [root] -profile: defaults/mini.biom
2023-12-11 15:28:24 INFO: [root] -samples: None
2023-12-11 15:28:24 INFO: [root] -o: out/
2023-12-11 15:28:24 INFO: [root] -tmp: /cluster/projects/nn9383k/cassandh/tmp
2023-12-11 15:28:24 INFO: [root] -reference_genomes: tools/assembly_summary_complete_genomes.txt
2023-12-11 15:28:24 INFO: [root] -additional_references: None
2023-12-11 15:28:24 INFO: [root] -config: defaults/default_config.ini
2023-12-11 15:28:24 INFO: [root] -ncbi: tools/ncbi-taxonomy_20170222.tar.gz
2023-12-11 15:28:24 INFO: [root] -no_replace: True
2023-12-11 15:28:24 INFO: [root] -fill_up: False
2023-12-11 15:28:24 INFO: [root] -community_only: False
2023-12-11 15:28:24 INFO: [root] -seed: None
2023-12-11 15:28:24 INFO: [root] -debug: True
2023-12-11 15:28:24 WARNING: [root] Max strains per OTU not set, using default (3)
2023-12-11 15:28:24 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
2023-12-11 15:28:24 WARNING: [root] 438753 taxid not found
2023-12-11 15:28:24 WARNING: [root] 198804 taxid not found
2023-12-11 15:28:24 WARNING: [root] 224915 taxid not found
2023-12-11 15:28:24 WARNING: [root] 107806 taxid not found
...
...
...
2023-12-11 15:28:34 WARNING: [root] 1933220 taxid not found
2023-12-11 15:28:34 WARNING: [root] 1936080 taxid not found
2023-12-11 15:28:34 WARNING: [root] 1936081 taxid not found
2023-12-11 15:28:34 WARNING: [root] Some OTUs could not be mapped
2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome3, scientific name Enterobacterales
2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome1, scientific name Escherichia coli
2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome2, scientific name Escherichia mulleri
2023-12-11 15:28:34 INFO: [root] Downloading 0 genomes
2023-12-11 15:28:35 INFO: [root] Community design finished
2023-12-11 15:28:42 ERROR: [Community] Invalid digit, must be bigger than 1, but was 0
2023-12-11 15:28:42 ERROR: [MetagenomeSimulationPipeline] [community0] Has an invalid value!
2023-12-11 15:28:42 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
AlphaSquad commented 10 months ago

The key problem in this output are these lines:

2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome3, scientific name Enterobacterales
2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome1, scientific name Escherichia coli
2023-12-11 15:28:34 WARNING: [root] No matching NCBI ID for otu Genome2, scientific name Escherichia mulleri

These errors mean that CAMISIM was not able to get NCBI IDs for these scientific names, while it definitely should have found them for the first two. You could test your ete3 by running python and interactively try this code:

from ete3 import NCBITaxa

ncbi = NCBITaxa()
name = "Enterobacterales"
ncbi.get_name_translator([name])

The result should be {'Enterobacterales': [91347]} if it is not, then your ete3 installation is still faulty.

CassandraHjo commented 10 months ago

The problem was the NCBI database downloaded by the ete3 package. When running the code above I only recived an empty dictionary ( {} ) as output. The problem was solved by deleting the local database downloaded by ete3, and running ncbi = NCBITaxa() again. More information can be found here

The default scripts for the from_profile design is now running.