Closed hjddev closed 3 months ago
It might not be the only mistake. I found rs118192099, which is also wrong. I believe there are more wrong SNPs.
There are differences within files from the same vendor. Depending on the chip version and even date of file generation. Within 23andMe v5, the currently shipped v5 file has three entries for that coordinate:
i5050775 MT 3221 A
i706536 MT 3221 A
rs193303018 MT 3242 --
rs199474657 MT 3243 --
i5050984 MT 3243 --
i705046 MT 3243 --
rs199474662 MT 3251 A
i704734 MT 3254 C
The previous v5 file, that WGS Extract follows, has the following entries:
i5050775 MT 3221
i706536 MT 3221
rs193303018 MT 3243 <--- Is this an error that was fixed? Or is the probe for that coordinate?
i5050984 MT 3243
i705046 MT 3243
rs199474657 MT 3244 <--- ditto
rs199474662 MT 3252 <--- ditto
i5050965 MT 3252 <--- missing in current
i3002055 MT 3254 <--- missing in current
i704734 MT 3254
The 23andMe API version for this area was (no longer accessible):
chrM 3221 i5050775
chrM 3221 i706536
chrM 3223 rs28735476
chrM 3242 rs193303018 <--- similar to current v5 delivery
chrM 3243 i5050984
chrM 3243 i705046
chrM 3243 rs199474657 <--- ditto
chrM 3251 rs199474662 <--- ditto
chrM 3252 i5050965
chrM 3254 i3002055
chrM 3254 i704734
which may seem to imply the previous v5 file that WGSE uses as a template is an error. But the API was developed and last accessible early in the v5 release.
We tend to deliver what the lab delivered. With the hope that tools that depend on reading the files are then expecting what the lab delivers as well. And if they are smart enough, correct for it if they can (or ignore it otherwise).
Sometimes it may be a mistake; sometimes later corrected. More often it is an issue with how they are defining the deliverable in the file versus what they are actually measuring with the probe. Sometimes the rsID itself changed, sometimes the definition of what the chip microarray probe is delivering changed, and sometimes both.
It is not clear if no calls are always delivered in that field from the samples we have. If so, we should really do the same or simply drop them from our template. We do not have enough samples to make that call.
Often these files were defined back in 2013 or earlier when the chips were developed. Some rsIDs are updated since then as refinements are made. Sometimes the rsID definition is changed and only applicable to a later patched version of the reference model. Which would require interpretation of the patch by all tools. Virtually no tools utilize patch updates to the models (and vendors in general do not map to the patched models). Patched models have not been modified from being a pure reference model to an "analysis" model that tools like aligners and variant callers utilize.
We are considering delivering two files. One that is accurate to what the vendor delivers(ed). And one where we think corrections are needed and so accurate to what the BAM represents. Without having a large enough sample over a large enough lab time period, it is difficult to do the latter (and even the former in some instances). There is no website to define the vendor microarray content. Most vendors seem to through out questionable content if they discover it is not reliable.
rs199474657 chromosome is MT:3243 (GRCh38)/MT:3243 (GRCh37). But in the data generated by the chip microarray test file generator of WGSExtract, rs199474657 points to MT:3244. This error occurred in generated 23andMe v5 data. The AncesteryDNA v2 data does not have this error. https://www.ncbi.nlm.nih.gov/snp/?term=rs199474657