Closed ronald-jaepel closed 8 years ago
@Phelimb - any word on this? I'm surprised with 30x there are so many we do not detect at all. I took a look at the JSON output for one ( @ronald-jaepel - could you please add some example JSONS in a comment? Or add them to the original report above? ) - the one I looked at showed we only saw 8% of the kmers in the true allele, and this is with error-free reads.
two diagrams, same data, different scales, depicting the certainty of the calls:
Thanks @ronald-jaepel - very informative plots
All the mistakes are due to differences in the catalogs! Sorry about the confusion! Should have checked that earlier.
So, except for three genes, all faults were due to the genes in Nicole's database not being in the reference catalog used by atlas.
The three exceptions are:
4, which is identical in both datasets, and appears to be the only true mistake made by atlas. (edit) was a padding problem.
67 is a different sequences in our reference (with 16 SNPs) compared to the Nicole's database. (In our reference cataloge 67's sequence is the same as 69's sequence)
94 doesn't have the last bases: AGTTAGTTTATAGCCCGGTTCACTTTTTACCA in Nicole's Database
Sorry guys, I missed this when it was first posted.
Thanks for the update Ronald. We should have a consistent way of versioning the database - which one of Nicoles have you used for the above analysis. I should have told you that the catalog in atlas is an older one - sorry! It comes as an excel spreadsheet so it's a bit of a pain to convert so I haven't updated it to the latest yet.
I have a script to convert here:
Though I think Nicole has sent around one recently that should be easier to parse so we might be better off starting from scratch writing a converter from spreadsheet -> fasta.
I used "combined_resistance_database_09022016.xlsx"
I think we need spreadsheet to fasta to github and then refer to that by changelist. Do we want a separate repo for tracking catalogs?
@iqbal-lab I'd rather separate it from the predictor and atlas codebases so I think a separate repo is a good idea.
agreed
so, apparently wgsim had a problem the position of genes in the genome: I previously only appended them at the end and when I repeat the experiment with OXA-4 at the end of the genome, it gets diagnosed as OXA-33 over and over again. But: If I copy it just a few lines up, so that more than 100 bases follow it, it gets called correctly.
Ah ok. So we might as well pad the genes before and after in future. Well spotted. So on error-free data 100% win for Atlas?
So, alignment of OXA-4 and OXA-33 shows, that they are identical (i.e. have no SNPs or INDELS) except for OXA-33 missing 17 bases at the back end and 45 bases at the start compared to OXA-4. Therefore, calling 4 as 33 is really just due to missing bits at either ends. http://www.ebi.ac.uk/Tools/services/rest/clustalo/result/clustalo-I20160415-155706-0969-80534339-oy/aln-clustal Mapping the reads generated by wgsim onto the reference genome also shows, that the coverage goes down from 3 to 1 reads per base for the last 24 bases. This screenshot shows the last 138 bp of OXA-4 appended to the reference genome and all reads that were mapped to it.
Can you email Nicole and cc me/Phelim to see if this is a bug? Looks to me like an error in the allele database
I have more insight into the genes in the catalogue and two questions before I write to Nicole:
regarding OXA-4 vs OXA-33: Nicoles Database just transfered the data from the ascension number they cited. OXA-33 is cited from NCBI as an "OXA-1-like" gene, that Nicole calls OXA-33. (Source: http://www.ncbi.nlm.nih.gov/nuccore/AY008291 )
(Here's the alignment of OXA-4 vs OXA-33 taken from NCBI and Nicoles Database http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=clustalo-I20160418-135502-0628-18285893-oy (the TAA where OXA-33 ends isn't in frame and therefore not a stop codon))
I'll ask her where she got the name OXA-33, is there anything else I should ask?
regarding OXA-94, which is shorter in Nicole's database: Her sequence stops after the stop codon, ours has a bit of 3' UTR included: http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=clustalo-I20160418-133224-0060-23120442-pg
Should I write to her about that or does that just mean that our version is too long?
Thank you for working through this with me!
Sounds to me like OXA-33 doesn't really exist in any sense differently to OXA-4, except in Nicole's database. (Agree @Phelimb ?). Ask Nicole why we cannot delete it (cc me and Phelim).
Re: OXA-94 - I think if ours is identical up to stop codon, then we can just take our longer one.
I agree and look like Nicole does too. I don't envy her trying to keep that database up to date - it's no easy task. Awesome job in debugging this @ronald-jaepel - will make all this comparision stuff much easier if the genes in the panel ar unique!
ok, so now that OXA-33 can be deleted, we resolved all errors that were made by Atlas genotype on the first batch, right?
I have new data on the other genes (ctxm, kpc, shv, tem, (oxa)): This time I used our reference catalogue to generate the simulated genomes-reads-etc. Almost ALL genes were called correctly. except for the 5 following:
Errors that are due to genes having alternative names (even with their own entry in Nicole's database specifying the alternative names):
Errors that are due to the sequences being identical in our catalogue, although there exist different sequences in Nicole's database:
Errors that I can't figure out:
We definitely need some validation code to run on the Nicole db, and to check for new issues when the db is updated. What is "the other genes" - all genes? The big 5? Anyway, this is great news.
Yes, sorry, forgot to mention that. "Other genes" are the big 5 for which alleles are present in our gn-amr-genes.fasta catalogue. (edited my previous comment to make things clearer)
with 5% error rate during read generation, Atlas genotype got 6 errors out of the 704 genes in the reference catalogue: (in addition to the 5 from above)
(wgsim was set to a base error rate of 0.05, standard is 0.02)
Interesting. It might be good to have a look at the coverages on the whole panel.
If you run with the --keep_tmp
option then it doesn't remove the temporary files after completing.
In the temp directory there will be files called :sample_id_:kmer_dize_:panel_name.covgs
. The columns are:
gene_name colour median_covg min_non_zero_covg percent_coverage
It would be interesting to see the rows for the "truth" call and the one predicted by atlas
.
first line is the "truth", second line is the prediction by atlas tmp_call_info.txt kpc?name=kpc&version=5 0 5 1 0.986063 kpc?name=kpc&version=2 0 5 1 0.986063
oxa?name=oxa&version=248 0 5 1 0.980099 oxa?name=oxa&version=69 0 5 1 0.980099
shv?name=shv&version=144 0 6 1 0.994048 shv?name=shv&version=38 0 6 1 0.994048
shv?name=shv&version=141 0 5 1 0.990476 shv?name=shv&version=163 0 5 1 0.991637
tem?name=tem&version=68 0 6 1 0.985714 tem?name=tem&version=47 0 6 1 0.997619
tem?name=tem&version=122 0 5 1 0.978571 tem?name=tem&version=163 0 5 1 0.978571
tem?name=tem&version=177 0 4 1 0.704762 tem?name=tem&version=191 0 5 1 0.881793
So you have median coverage between 1 and 5 and PERCENT coverage <1% ?
I thought that was partial coverage, so percentage coverage of e.g. 98%, because I never saw a value above 1.
What I found today is that atlas walk does better than atlas genotype. It only makes 3 errors out of the 7 listed above (including tem117 -> tem191) when run with one of the two files of paired end reads. It currently crashes when I try to run it with both files due to memory limitations. Also, the mistakes change, depending on which part of the paired-end-reads was used.
Paired_end_1 as single_read: Paired_end_2 as single_read: for this run I also have the covgs info: kpc?name=kpc&version=5 0 1 1 0.676851
shv?name=shv&version=141 0 1 1 0.749398 shv?name=shv&version=163 0 1 1 0.752116
tem?name=tem&version=68 0 1 1 0.681928 tem?name=tem&version=47 0 1 1 0.712048
tem?name=tem&version=122 0 1 1 0.680723 tem?name=tem&version=192 0 1 1 0.702628
Yes, that's 99% not 0.99%.
Interesting that one of the problems seems to be that we fall back on the first version if two versions are indistinguishable.
Happy and slightly surprised to see atlas walk
doing so well - it's still a very rough algorithm!
What do you mean by "which part of the paired-end-reads"? are you not running with both read sets together? I'd recommend doing this.
The mistaking TEM122 for TEM192 is odd given they only have 87.5% sequence identity
Ping @Phelimb - outstanding genotyping bug here with errors which are a long way from the truth
The mistaking TEM117 for TEM192 (with 87.5% identity) couldn't be reproduced when the coverage was raised to 50x with 0.5% errors (instead of 5%) and trailing N's were removed. Might have been a coverage issue.
Yeah, might also just be result of that particular simulation. I.e. low coverage on that particular gene by chance. With 20x and 0.5% errors this isn't implausible.
I think 50x is a reasonable coverage for these simulations. Most of the data we have are >50x and very few are < 20x. We should be able to do well for low coverage samples soon but I think we should debug issues at higher coverage first. @iqbal-lab do you agree?
I think there are two separate issues.
Item 2 I will raise a separate issue about - now is a good time to refresh and improve our approach to coverage and mixture, and we should have a probabilistic output we can look at with confidence. I'l raise an issue with a proposal.
This simulation is intended to
So I think this issue OK to close
@ronald-jaepel this bug is closed. Are there any examples of Atlas walk wrongly finding an allele using a sensible error rate and coverage? I was under the impression there were examples, but this bug was closed because it was not reproducible with error rate 0.5% and coverage 50x. @Phelimb
Screened all Versions of OXA by inserting them into the reference genome ecoli_K12_MG1655_ref and generating error-free paired end reads with wgsim. Then the simulated reads were genotyped with atlas and the summary is in the table below. It is sorted to first show the versions that were not found at all, then the ones which were found but the version was estimated incorrectly and lastly the ones which were found and typed right. atlas_OXA_detection.xlsx
Not found were: 184, 185, 266, 269, 270, 271, 272, 273, 274, 275, 277, 279, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 302, 303, 304, 305, 306, 307, 308, 421
ecoli_oxa2.txt (identified correctly) ecoli_oxa4.txt (found but identified wrong) ecoli_oxa184.txt (not found)