CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
158 stars 36 forks source link

Error in strain evolution #142

Closed YiJessePi closed 1 year ago

YiJessePi commented 1 year ago

First, I would like to thank you for this great tool and its maintenance! Second, I would love to better understand the differences in simulating mutations for real genomes and simulated genomes. Does real genomes subjected only to sequencing error profiles (ART+MBARC) and simulated genomes also undergo strain evolution with sgEvolver? Is there any other process in the simulations that introduce mutations to the simulated reads?

Now for the error itself, I'm using metagenomesimulation.py and wish to have also simulated genomes. The command (debug mode): python /tools/CAMISIM/metagenomesimulation.py "config.ini" --debug The error:

2022-09-21 12:48:28 INFO: [MetadataReader 84621696902] Reading file: '50_genomes.txt'

2022-09-21 12:48:28 INFO: [CommunityDesign] Validating raw sequence files! 2022-09-21 12:48:28 WARNING: [Validator 6183491518] No gff file (gene annotation) was given. Simulating strains without such a file can break genes. 2022-09-21 12:48:28 INFO: [Validator 6183491518] Simulating strain evolution of 'genome-11' 2022-09-21 12:48:31 DEBUG: [MetagenomeSimulationPipeline] Traceback (most recent call last): File "/labs/tools/CAMISIM/metagenomesimulation.py", line 83, in run_pipeline genome_id_to_path_map, list_of_file_paths_distributions = self._design_community() File "/labs/tools/CAMISIM/metagenomesimulation.py", line 261, in _design_community directory_in_template=directory_simulation_template) File "/labs/tools/CAMISIM/scripts/ComunityDesign/communitydesign.py", line 549, in design_samples directory_in_template=directory_in_template) File "/labs/tools/CAMISIM/scripts/ComunityDesign/communitydesign.py", line 345, in design_community genome_id_to_file_path_gff=genome_id_to_file_path_gff) File "/labs/tools/CAMISIM/scripts/StrainSimulationWrapper/strainsimulationwrapper.py", line 433, in simulate_strains self._pick_random_strains(meta_table, genome_id_to_amounts, genome_id_to_file_path_genome) File "/labs/tools/CAMISIM/scripts/StrainSimulationWrapper/strainsimulationwrapper.py", line 530, in _pick_random_strains os.rename(source, destination) FileNotFoundError: [Errno 2] No such file or directory: 'test3/tmp/tmpjgs_wdyu/genome-11.strains/Taxon015.fasta' -> 'test3/tmp/tmpjgs_wdyu/genome-11.strains/simulated_genome-11.Taxon015.fna' 2022-09-21 12:48:31 ERROR: [MetagenomeSimulationPipeline] [Errno 2] No such file or directory: 'test3/tmp/tmpjgs_wdyu/genome-11.strains/Taxon015.fasta' -> 'test3/tmp/tmpjgs_wdyu/genome-11.strains/simulated_genome-11.Taxon015.fna' in line 83

The file is indeed missing although I'm pretty sure sgEvolver finished, other Taxon*.fasta does exist, not sure why strainsimulationwrapper.py tries to rename non-existing files specifically. To note- I do not have gff for these genomes which indeed invoke a warning. Thanks!

AlphaSquad commented 1 year ago

Thank you for the kind words! If CAMISIM is set up such that it simulates strains, it will first do so. The read simulation process then is independent of whether strains were simulated or not, so all errors in the reads afterwards are supposed to be of technical nature (i.e. sequencing errors). The nature of these errors are determined by the simulator and error profile (e.g. like ART+MBARC like you mentioned). About the error: Unfortunately, I am not sure what the problem might be, could you provide me with the config-file/options you used so I can test this on my end?

YiJessePi commented 1 year ago

Thank you. Of course! The config file looks like this: ` [Main] seed=11111 phase=0 max_processors=8 dataset_id=RL output_directory=/labs/outdir temp_directory=/labs/outdir/tmp gsa=True pooled_gsa=True anonymous=False compress=0

[ReadSimulator] readsim=/labs/tools/CAMISIM/tools/art_illumina-2.3.6/art_illumina error_profiles=/labs/tools/CAMISIM/tools/art_illumina-2.3.6/profiles samtools=/labs/tools/CAMISIM/tools/samtools-1.3/samtools profile=mbarc base_profile_name= profile_read_length= size=0.1 type=art fragments_size_mean=270 fragment_size_standard_deviation=27

[CommunityDesign] distribution_file_paths= ncbi_taxdump=/labs/tools/CAMISIM/tools/NCBI strain_simulation_template=/labs/tools/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simulation_dir number_of_samples=1

[community0] metadata=/labs/sample12/metadata_50_genomes.txt id_to_genome_file=/labs/sample12/50_genomes.txt id_to_gff_file= genomes_total=10 num_real_genomes=1 max_strains_per_otu=1 ratio=1 mode=differential log_mu=1 log_sigma=2 gauss_mu=1 gauss_sigma=1 view=False ` To note- several files were generated under tmp/tmp/genome.strains. Multiple TaxonXXX.fasta files, evolved.dat, evolved_seqs.fas, but in this case Taxon015 specifically was not created and it seems that strainsimulationwrapper.py tries to rename it. Thank you for your support!

AlphaSquad commented 1 year ago

Hey, sorry for not having responded in a while. I have tried a lot to reproduce this problem, but for me - using your config file but the genomes from defaults/genomes/ - the strain simulation and the consecutive CAMISIM run did in fact work. It is a little confusing that some genomes were simulated, because there are multiple possible problems, but most of these cause the strain simulation to fail entirely (e.g. relative paths in the id_to_genome_file). If you still have the temporary files, does the evolved_seqs.fas also does not contain Taxon015? Did you try running the command (stored in the command_lines.txt in the temporary directory) manually? Thanks