CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
170 stars 37 forks source link

How to speed up the microbiome simulation process using CamiSim #163

Closed YunyunGao374 closed 1 year ago

YunyunGao374 commented 1 year ago

Hello,

I'm interested in finding ways to speed up the microbiome simulation process using CamiSim. I have a host genome and 20 bacterial genomes, totaling around 1.1 GB in size. My goal is to create a simulated microbiome population of approximately 6Gb. Currently, the simulation has been running for 190 hours, clearly, the speed remains quite slow.

I attempted to improve the speed by adjusting the max_processors parameter, trying values like 32, 64, and 128, but unfortunately, this didn't lead to any noticeable changes in the speed. Additionally, I experimented with running the command python ~/camisim/CAMISIM-master/metagenomesimulation.py test.ini --thread 20, hoping to utilize multiple threads, but encountered an error indicating that the --thread 20 option is not supported.

Given that I'm new to data analysis, I'm reaching out for assistance. Could you provide any guidance or suggestions to help me address these challenges? Thanks.

AlphaSquad commented 1 year ago

Hi, thank you for your interest in CAMISIM. CAMISIM can be somewhat slow, but 190 hours for a single sample of 6 GB seems excessive. Could you attach the config file you were using to find out more? Unfortunately, CAMISIM in its current state has some parts which are not properly parallelised, so adding more cores and threads will not help much, particularly because most of the work is actually file I/O which is limited more by the speed of your hard drive(s) than your processor(s). A new version of CAMISIM using nextflow and thereby more efficiently utilising parallelisation will be out hopefully this fall

YunyunGao374 commented 1 year ago

Yes, and I checked the tmp folder, It shows some files are still under-created, which means it is still under running.

Sorry I cannot attach the config file directly, but here is the information.

[Main] seed=632741178 phase=0 max_processors=32 dataset_id=OTU output_directory=/public/home/liuyongxin/gyy/human/out1 temp_directory=/public/home/liuyongxin/gyy/human/tmp1 gsa=True pooled_gsa=True anonymous=True compress=1

[ReadSimulator] readsim=~/db/soft/camisim/CAMISIM-master/tools/art_illumina-2.3.6/art_illumina error_profiles=~/db/soft/camisim/CAMISIM-master/tools/art_illumina-2.3.6/profiles samtools=~/db/soft/camisim/CAMISIM-master/tools/samtools-1.3/samtools profile=mbarc size=5.4 type=art fragments_size_mean=270 fragment_size_standard_deviation=27

[CommunityDesign]

distribution_file_paths=out/abundance0.tsv,out/abundance1.tsv,out/abundance2.tsv,out/abundance3.tsv,out/abundance4.tsv,out/abundance5.tsv,out/abundance6.tsv,out/abundance7.tsv,out/abundance8.tsv,out/abundance9.tsv

ncbi_taxdump=~/db/soft/camisim/CAMISIM-master/tools/ncbi-taxonomy_20170222.tar.gz strain_simulation_template=~/db/soft/camisim/CAMISIM-master/scripts/StrainSimulationWrapper/sgEvolver/simulation_dir number_of_samples=5

[community0] metadata=/public/home/liuyongxin/gyy/human/metadata1.tsv id_to_genome_file=/public/home/liuyongxin/gyy/human/genome_to_id1.tsv id_to_gff_file= genomes_total=25 num_real_genomes=25 max_strains_per_otu=1 ratio=1 mode=differential log_mu=1 log_sigma=2 gauss_mu=1 gauss_sigma=1 view=False

AlphaSquad commented 1 year ago

Note that you are creating 5 samples with 5.4 Gb each, so in total 27 Gb data. 190 hours still seems a lot, but it is not entirely impossible that CAMISIM is actually running that long. I have noticed a significant slowdown of CAMISIM when one of the input genomes is big - it seems that is the case for your host genome (~1 GB?), so I suspect that to be the "culprit". If I remember correctly that slowdown unfortunately is because of the read simulation (using ART) and there is not much we can do. I suggest you let it run for a little longer and see if CAMISIM finishes in another few days. Alternatively you could split up your host genome into e.g. 10 smaller parts and try if that speeds things up.

YunyunGao374 commented 1 year ago

Yes, I tried to get five multiple replicated populations. Thanks so much for your suggestion, I will try to separate the big genome into several smaller parts to see the results, Thanks again.