CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
167 stars 36 forks source link

Guidance on abundance.tsv file and output #109

Open NienkeMekkes opened 3 years ago

NienkeMekkes commented 3 years ago

Dear authors,

I remember asking this question before, but when looking back, I still am having some last doubts I was hoping you could help me with.

This is a summary of organism, abundance.tsv, and genome size Salmonella bongori | 0.10 | 4487548 Salmonella enterica | 0.4 | 5028552 escherichia coli | 0.5 | 4643559

When I use CAMISIM to simulate my sample, I use the readmappings.tsv output file with grep -c to check my abundance. In this output file, I have exactly 10% of reads belonging o S bongori, 40% to S enterica, and 50 to e.coli

Is this expected behaviour? I read that the simulation is supposed to take genome length into consideration, so I expected that the amount of reads belonging to S.enterica would be inflated, since the genome is bigger. Why is this not the case?

AlphaSquad commented 3 years ago

Hm, that is strange, are you sure that these values fit exactly? The difference between S. bongori and S. enterica in genome size is only ~10%, so instead of 10% and 40% I would expect CAMISIM to simulated something like 9% and 41% of reads for both: If the coverage is ratio 4:1 between the two genomes and the genome size ratio is 1.1:1, then we expect a read ratio of 4.4:1 in reads between the two - which translates to ~40.75%:9.25% (here neglecting the third genome which is not exactly halfway between the two Salmonella)

NienkeMekkes commented 3 years ago

Hi and thank you for your quick reply. Your calculations are also what I expected to happen.

When I use grep -c on my readmappings.tsv file, I find this file has a total of 1333338 lines, with the counts of lines and percentage: bongori | 133334 | 0.10000015 enterica | 666666 | 0.4000021 e.coli | 533338 | 0.49999775

So while not exact, my readmappings file does fit closer to the abundance I put in the abundance.tsv file, then what I would expact from making these caclulations myself. Is there a better way to check the fraction of reads simulated besides grep -c on the readmappings.tsv?

AlphaSquad commented 3 years ago

I think grep -c on the readmappings file should be fine. I tried the same on a few of my datasets and the results were as I expected it (i.e. the larger genomes having far more reads than the smaller ones, even with lower abundance). Which read simulator did you use?

NienkeMekkes commented 3 years ago

wgsim, so error free simulation. Could this cause the behaviour?

AlphaSquad commented 3 years ago

I will have to look into it, but I have the strong suspicion that this is a bug in CAMISIM when using wgsim. Could you run the same data set using ART and look if the distribution of reads is as you (and I :wink: ) would expect?

NienkeMekkes commented 3 years ago

sure!

NienkeMekkes commented 3 years ago

Hi again,

I ran the simulation again with:

readsim=tools/art_illumina-2.3.6/art_illumina error_profiles=tools/art_illumina-2.3.6/profiles samtools=tools/samtools-1.3/samtools profile=mbarc size=0.1 type=art fragments_size_mean=270 fragment_size_standard_deviation=27

This resulted in:

  grep -c % grep -c count expected
S bongori 0.094 62884 ~0.09
E coli 0.490 326554 ~0.49
S enterica 0.416 277222 ~0.42
    666660  

The expected row is what I (and you :) )expected my distribution to look like. It is a very good match when using art! For error free simulation, I think I can still use my simulated reads (my only goal is to use them together with kraken/bracken, and check if I can find the abundance simulated with bracken), since the simulated numbers are what I expected.

AlphaSquad commented 3 years ago

Okay, that is good to hear. It is still a bug/inconsistency for wgsim (and likely for Nanosim as well) which I will fix. If you can use your data regardless that is great.