CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
150 stars 36 forks source link

MBARC Error Profile #72

Closed mruehlemann closed 4 years ago

mruehlemann commented 4 years ago

Hi,

can you point me to where the MBARC error profile for ART is coming from? I could not find any information on that.

In any way, I think it introduces way to high error rates that are more comparable to amplicon sequencing / metabarcoding (is this where the name M(eta)BARC(oding) is coming from?) than to actual shotgun metagenomes. Here the normal HiSeq2500 error profiles seem muche more appropriate!

Thank you! Malte

AlphaSquad commented 4 years ago

Hi Malte, thank you vor your interest in CAMISIM. The MBARC profile for ART was generated from the reads described in this paper: Next generation sequencing data of a defined microbial mock community, they state:

Sequencing of the flowcell was performed on the Illumina HiSeq2000 sequencer using a TruSeq SBS sequencing kit

Unfortunately there was a problem with the provided HiSeq2500 profiles of ART which showed that these reads also had an unrealistic error profile, showing a W-shape of errors over the length of the reads (many errors in the beginning, in the middle and in the end of the read), so we decided to use a published, "metagenomic" data set for training. If the HiSeq2500 profile seems more appropriate for your use-case, it should be possible to use this profile instead, or is that not working for you?

HenrikSpiegel commented 1 year ago

Hi @AlphaSquad, Can you expand a bit on how the mbarc error profile was trained and do you have some general statistics for the profile such as overall error-rate?

Thanks, Henrik

AlphaSquad commented 1 year ago

ART - described here and downloadable here - has a method to learn the error rate from a given data set. The MBARC error profile was trained from the data in the paper described in this issue previously. The error rate of the reads is ~1.3%.

HenrikSpiegel commented 1 year ago

Hi (@AlphaSquad) Adrian, Thanks for the answer, however, I have a follow up. The error rate you report, is that the empirical error rate given by the sequencing technology for the mbarc data or an estimate of the error rate of the simulated reads created by the CAMISIM software using the profile trained on the mbarc data.

The reason why I 'challenge' the 1.3% error rate is that I tried estimating the error profile from mapping back the generated reads and using samtools stats to determine the per base error rate. The experiment I set up was with 5 genomes and the config at the bottom. I ran wgsim compare my estimate with the set error rate, and then ART-mbarc.

As you can see the wgsim error rate is slightly underestimated but otherwise appears fine. The ART-mbarc error profile is however estimated at ~3% which is much higher than the 1.3%.

Error type Profile Sample size (GB) Mapped Error rate (samtools)
WGSIM 3% 0.2 2.86%
5% 0.3 4.65%
ART mbarc 0.1 3.06%
0.2 3.06%
0.3 3.06%
0.5 3.05%

And do you know how dependent the error rate is on the community (GC content, repeat regions etc.).

Thanks for your time.

config.ini

[Main]
seed=1991860661
phase=0
max_processors=29
gsa=True
pooled_gsa=True
anonymous=True
compress=1

[ReadSimulator]
readsim=/services/tools/camisim/1.3/tools/art_illumina-2.3.6/art_illumina
error_profiles=/services/tools/camisim/1.3/tools/art_illumina-2.3.6/profiles
samtools=/services/tools/camisim/1.3/tools/samtools-1.3/samtools
profile=mbarc
size=0.1
type=art
fragments_size_mean=270
fragment_size_standard_deviation=27

[community0]
genomes_total=5
num_real_genomes=5
max_strains_per_otu=1
ratio=1
mode=differential
log_mu=1
log_sigma=0.25
AlphaSquad commented 1 year ago

Hi Henrik, your estimate is probably more accurate then the value I reported and I should have explained where the 1.3% came from, namely by using a few bam-files produced by ART/CAMISIM and counting the mismatches in the CIGARs. Since your samtools stats error rate seems to be very consistent in different runs, I would assume that your calculation is correct, I am sorry for the confusion. I wouldn't expect the error rate to be dependent on the input genomes, at least I am not aware that ART differentiates there.