Can I model abundance distributions of functional genes?

CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities

https://data.cami-challenge.org/participate

Apache License 2.0

174 stars 37 forks source link

Can I model abundance distributions of functional genes? #94

Closed xuechunxu closed 3 years ago

xuechunxu commented 4 years ago

hello,

Can I model abundance distributions of functional genes and to simulate corresponding shotgun metagenome datasets?

Very thanks!

chunxu

AlphaSquad commented 4 years ago

Hey, unfortunately I am not entirely sure what you are planning to do. In my understanding, shotgun metagenomics entails complete genomes being simulated, but it sounds like you only want to simulate functional genes?

xuechunxu commented 4 years ago

Hey, I want to simulate metagenome datasets, which I know the relative abundance of functional genes. For example, the relative abundance of gene A is known in CAMISIM output. Using functional gene instead of complete genome. But I think it can not work.

xuechunxu commented 4 years ago

Another doubt, what's the meaning of "seed" in defaults/mini_config.ini file. Can I set the size of simulated reads?

AlphaSquad commented 4 years ago

If you only want to simulate reads from these functional genes you would need to use these as your "genomes".

A seed is used to ensure reproducability: Since the read simulators work with randomness, setting the same seed for the random number generators ensures that the output is the same for two runs.

The size of the reads is controlled with the fragments_size_mean and fragment_size_standard_deviation parameters in the config file. The size parameter describes the size per sample (in Gigabases).

xuechunxu commented 4 years ago

Sorry, I didn't make it clear about the size of the reads. I mean the size of the file of simulated reads, that is the file anonymous_reads.fq.gz.

AlphaSquad commented 4 years ago

Then the size parameter will be the controlling factor. The size of the read file will be roughly number_of_samples * size (in GB). Since the file itself is compressed afterwards, the actual size might be a little less

xuechunxu commented 4 years ago

I used the same data and parameters to run CAMISIM, but the result is different every time.

Cyanobacteria.zip This is the file I used, and run python metagenomesimulation.py defaults/mini_config.ini I set number_of_samples=5, and the same relative abundance of genomes for each samples. I guess these five simulated sample reads are the same. But they are different.

AlphaSquad commented 4 years ago

If you want to manually set your abundance distributions, you need to add a parameter to the config: distribution_file_paths in the CommunityDesign section´. This parameter points to your abundance files, which have to be tab-separated files with genome_ID and abundance. See also here. If you have done this and set the seed parameter, then two subsequent runs of CAMISIM will be the same. The reads of the individual samples will still differ though, because the way we designed CAMISIM we don't want the exact same reads from the same genomes in two different samples. If you want to do this, you would have to start two different CAMISIM runs with the same seed.