CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
169 stars 37 forks source link

How to use simulated strains? #100

Closed SR-Martin closed 2 years ago

SR-Martin commented 3 years ago

I'd like to use CAMISIM to simulate a metagenome with a lot of strain level variation, but I am having some trouble. The documentation states "Artificial strains evolved from real genomes are added to the community genome collection until the difference between genomes total and num real genomes has been reached." This suggests e.g. setting num_real_genomes=5 and genomes_total=10 (and max_strains_per_otu > 1) to include 5 strains simulated from the real genomes.

In this case, if my metadata contains 10 real genomes, then these all appear in the resulting metagenome, and there are no simulated strains. If there are fewer than 10 genomes in the metadata then I get the following error:

ERROR: [DefaultLogging] Not enough data to draw. ERROR: [MetagenomeSimulationPipeline] Not enough data to draw. in line 83

Are there some extra parameters that need to be set? Or maybe I have misunderstood how CAMISIM works? Please help!

AlphaSquad commented 3 years ago

As far as I can see what you did is supposed to work. I could reproduce this behaviour and am investigating.

AlphaSquad commented 3 years ago

Hey, I think I found the source of this bug and fixed it. Can you test whether it works for you? Note that for better results with simulated strains you should provide a .gff file containing the coding sequences of the genomes to be simulated such that these aren't evolved.

SR-Martin commented 3 years ago

Hi Adrian,

Thanks for looking at this so quickly. I'll get the new version and test it asap.

SR-Martin commented 3 years ago

This seems to have fixed the problem, however I am encountering another error. It's not clear to me yet whether this is a problem with the installation (I am using it on an HPC, which makes things a bit more complex) or if it is a bug in the software. Here is the output anyway:

2020-12-17 12:50:29 WARNING: [Validator 10177412903] No gff file (gene annotation) was given. Simulating strains without such a file can break genes. 2020-12-17 12:50:29 INFO: [Validator 10177412903] Simulating strain evolution of 'Genome12.0' 2020-12-17 12:50:29 INFO: [Validator 10177412903] Simulating strain evolution of 'Genome14.0' Failure in sgEvolver at /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl line 49. Failure in sgEvolver at /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl line 49. Task failed with return code: 255, task: cd /tmp/tmp3nv212/Genome12.0.strains; /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl defaults/genomes/GCA_000231385.3_ASM23138v3.fa /tmp/tmp3nv212/tmpizoVzO 5633002653896701005 >> /tmp/tmp3nv212/Genome12.0.strains/GCA_000231385.3_ASM23138v3.fa.sim.log Task failed with return code: 255, task: cd /tmp/tmp3nv212/Genome14.0.strains; /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl defaults/genomes/GCA_000006785.2_ASM678v2.fa /tmp/tmp3nv212/tmpizoVzO 6180714611745142895 >> /tmp/tmp3nv212/Genome14.0.strains/GCA_000006785.2_ASM678v2.fa.sim.log 2020-12-17 12:50:30 ERROR: [Validator 10177412903] Task failed with return code: 255, task: cd /tmp/tmp3nv212/Genome12.0.strains; /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl defaults/genomes/GCA_000231385.3_ASM23138v3.fa /tmp/tmp3nv212/tmpizoVzO 5633002653896701005 >> /tmp/tmp3nv212/Genome12.0.strains/GCA_000231385.3_ASM23138v3.fa.sim.log

2020-12-17 12:50:30 ERROR: [Validator 10177412903] Task failed with return code: 255, task: cd /tmp/tmp3nv212/Genome14.0.strains; /usr/local/bin/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl defaults/genomes/GCA_000006785.2_ASM678v2.fa /tmp/tmp3nv212/tmpizoVzO 6180714611745142895 >> /tmp/tmp3nv212/Genome14.0.strains/GCA_000006785.2_ASM678v2.fa.sim.log

2020-12-17 12:50:30 ERROR: [Validator 10177412903] Simulation of strains failed.

I'll look into this and see if I can find the cause of the failure.

AlphaSquad commented 3 years ago

Ah yes, you will need to use absolute paths in the genome_to_id.tsv file for this to work.

SR-Martin commented 3 years ago

Great, this seems to be working now. Thanks for your help!

yazhinia commented 2 years ago

Hello, I do observe a similar error in simulating strain evolution. Failure in sgEvolver at /path/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl line 49. 2022-07-13 15:08:02 ERROR: [MetagenomeSimulationPipeline] [Errno 2] No such file or directory: '/tmp/tmpxayr4vkn/Genome17.0.strains/Taxon014.fasta' -> '/tmp/tmpxayr4vkn/Genome17.0.strains/simulated_Genome17.0.Taxon014.fna' in line 83 2022-07-13 15:08:02 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

I followed the solutions suggested in this issue as well as issue #132 But didn't resolve the problem. Could you provide some hints to rectify this error? Thank you.

AlphaSquad commented 2 years ago

Hi, just looking at this error it seems most likely that the sgEvolver itself failed (line 49 is the call to sgEvolver). Could you post your complete log (and your config file) if you still have it available? Unfortunately it is hard to tell what is going wrong just from this message alone.

yazhinia commented 2 years ago

The config.ini file: `seed=632741178 phase=0 max_processors=8 dataset_id=RL output_directory=/home/users/yazhini.a01/software/CAMISIM temp_directory=/tmp gsa=True pooled_gsa=True anonymous=True compress=1

[ReadSimulator] readsim=/home/users/yazhini.a01/software/CAMISIM/tools/art_illumina-2.3.6/art_illumina error_profiles=/home/users/yazhini.a01/software/CAMISIM/tools/art_illumina-2.3.6/profiles samtools=/home/users/yazhini.a01/software/CAMISIM/tools/samtools-1.3/samtools profile=mbarc base_profile_name= profile_read_length= size=0.1 type=art fragments_size_mean=270 fragment_size_standard_deviation=27

[CommunityDesign] distribution_file_paths= ncbi_taxdump=/home/users/yazhini.a01/software/CAMISIM/tools/ncbi-taxonomy_20170222.tar.gz strain_simulation_template=/home/users/yazhini.a01/software/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simulation_dir number_of_samples=20

[community0] metadata=/home/users/yazhini.a01/software/CAMISIM/defaults/metadata.tsv id_to_genome_file=/home/users/yazhini.a01/software/CAMISIM/defaults/genome_to_id.tsv id_to_gff_file=/home/users/yazhini.a01/software/CAMISIM/defaults/genome_to_gff.tsv genomes_total=5 num_real_genomes=3 max_strains_per_otu=2 ratio=1 mode=differential log_mu=1 log_sigma=2 gauss_mu=1 gauss_sigma=1 view=False`

The terminal output is given below: 2022-07-13 18:03:06 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting 2022-07-13 18:03:06 INFO: [MetagenomeSimulationPipeline] Validating Genomes 2022-07-13 18:03:06 INFO: [MetadataReader] Reading file: '/home/users/yazhini.a01/software/CAMISIM/defaults/genome_to_id.tsv' 2022-07-13 18:03:24 INFO: [MetagenomeSimulationPipeline] Design Communities 2022-07-13 18:03:24 INFO: [CommunityDesign] Drawing strains. 2022-07-13 18:03:24 INFO: [MetadataReader 1918002902] Reading file: '/home/users/yazhini.a01/software/CAMISIM/defaults/metadata.tsv' 2022-07-13 18:03:24 INFO: [MetadataReader 9013202836] Reading file: '/home/users/yazhini.a01/software/CAMISIM/defaults/genome_to_gff.tsv' 2022-07-13 18:03:24 INFO: [MetadataReader 46447426715] Reading file: '/home/users/yazhini.a01/software/CAMISIM/defaults/genome_to_id.tsv' 2022-07-13 18:03:24 INFO: [CommunityDesign] Validating raw sequence files! 2022-07-13 18:03:27 INFO: [Validator 31395689975] Simulating strain evolution of 'Genome17.0' Failure in sgEvolver at /home/mpg01/MBPC/yazhini.a01/software/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simujobrun.pl line 49. 2022-07-13 18:03:27 ERROR: [MetagenomeSimulationPipeline] [Errno 2] No such file or directory: '/tmp/tmp7f3v_qmh/Genome17.0.strains/Taxon014.fasta' -> '/tmp/tmp7f3v_qmh/Genome17.0.strains/simulated_Genome17.0.Taxon014.fna' in line 83 2022-07-13 18:03:27 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

AlphaSquad commented 2 years ago

Okay, thank you, it looks fine in principle. I am investigating this, could you mean while run CAMISIM with the -debug flag and pipe the log to a file to see if anything more shows?

yazhinia commented 2 years ago

Thank you. With the usage of -debug flag, some files are written in tmp folder. Here are some details from sgEvolver.err

Unhandled gnException: Exception FileNotOpened thrown from Unknown() in gnFileSource.cpp 67 Called by Unknown() Exited with code 65280

and from GCA_000242255.3_ASM24225v3.fa.sim.log

Executing /home/mpg01/MBPC/yazhini.a01/software/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/sgEvolver --stop-codon-bias=0.98 --ancestral-gff=defaults/gffs/Genome17.0-GCF_000242255.2_genomic.gff --accessory-gff=defaults/gffs/Genome17.0-GCF_000242255.2_genomic.gff --indel-size=1 --indel-freq=0.05 --small-ht-freq=0.05 --small-ht-size=200 --large-ht-freq=0.005 --inversion-freq=0.005 --large-ht-min=10000 --large-ht-max=60000 --random-seed=358327309434766444 --inversion-size=50000 template.tree defaults/genomes/GCA_000242255.3_ASM24225v3.fa defaults/genomes/GCA_000242255.3_ASM24225v3.fa evolved.dat evolved_seqs.fas >sgEvolver.out 2>sgEvolver.err

AlphaSquad commented 2 years ago

Could you make sure that all the files referenced in this command are present and it runs in a vacuum? I.e. the genome file seems to one of the genomes part of CAMISIM by default while the gff file comes from you?

yazhinia commented 2 years ago

Yes, I have obtained .gff file from NCBI (for each of the default genomes given in the CAMISIM) as you had suggested to give it as added input for strain simulation. So how do I give the .gff file then? Also, sorry I don't understand the statement it runs in a vacuum.

AlphaSquad commented 2 years ago

The line you send from the log describes the call to sgEvolver which CAMISIM internally creates. This command should be possible to run without the usage of CAMISIM, i.e. running /home/mpg01/MBPC/yazhini.a01/software/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/sgEvolver --stop-codon-bias=0.98 --ancestral-gff=defaults/gffs/Genome17.0-GCF_000242255.2_genomic.gff --accessory-gff=defaults/gffs/Genome17.0-GCF_000242255.2_genomic.gff --indel-size=1 --indel-freq=0.05 --small-ht-freq=0.05 --small-ht-size=200 --large-ht-freq=0.005 --inversion-freq=0.005 --large-ht-min=10000 --large-ht-max=60000 --random-seed=358327309434766444 --inversion-size=50000 template.tree defaults/genomes/GCA_000242255.3_ASM24225v3.fa defaults/genomes/GCA_000242255.3_ASM24225v3.fa evolved.dat evolved_seqs.fas >sgEvolver.out 2>sgEvolver.err in your console/bash from the CAMISIM directory should work if all the files are present (it is strange though that template.tree which is in the scripts/StrainSimulationWrapper/sgEvolver/simulation_dir folder does not have a prefix). This makes me think that you probably should use absolute paths for all your genomes in the genome_to_id.tsv and genome_to_gff.tsv files If it does not run then the problem is in sgEvolver or in one of the files provided to this command. If it does run, then the problem lies within CAMISIM.

yazhinia commented 2 years ago

Thank you for the indication. So eventually the absolute path information has to be given in the genome_to_id.tsv and genome_to_gff.tsv. It works normally now. It turns out that the same solution as you mentioned before but I could understand it only now. Thanks very much.

AlphaSquad commented 2 years ago

Yes, sorry that should be documented in a better way (since it works if no strains are simulated). I will add something to the documentation and hope that these kind of errors disappear in our 2.0 version which is coming soon™