WGSExtract / WGSExtract-Dev

WGS Extract Developers Repository
GNU General Public License v3.0
20 stars 7 forks source link

Wrong data generated by chip microarray test file generator #15

Closed hjddev closed 3 months ago

hjddev commented 3 months ago

rs199474657 chromosome is MT:3243 (GRCh38)/MT:3243 (GRCh37). But in the data generated by the chip microarray test file generator of WGSExtract, rs199474657 points to MT:3244. This error occurred in generated 23andMe v5 data. The AncesteryDNA v2 data does not have this error. https://www.ncbi.nlm.nih.gov/snp/?term=rs199474657

hjddev commented 3 months ago

It might not be the only mistake. I found rs118192099, which is also wrong. I believe there are more wrong SNPs.

RandyHarr commented 3 months ago

There are differences within files from the same vendor. Depending on the chip version and even date of file generation. Within 23andMe v5, the currently shipped v5 file has three entries for that coordinate:

i5050775    MT  3221    A
i706536     MT  3221    A
rs193303018 MT  3242    --
rs199474657 MT  3243    --
i5050984    MT  3243    --
i705046     MT  3243    --
rs199474662 MT  3251    A
i704734     MT  3254    C

The previous v5 file, that WGS Extract follows, has the following entries:

i5050775    MT  3221
i706536     MT  3221
rs193303018 MT  3243             <--- Is this an error that was fixed? Or is the probe for that coordinate?
i5050984    MT  3243
i705046     MT  3243
rs199474657 MT  3244             <--- ditto
rs199474662 MT  3252             <--- ditto
i5050965    MT  3252                     <--- missing in current
i3002055    MT  3254                     <--- missing in current
i704734     MT  3254

The 23andMe API version for this area was (no longer accessible):

chrM    3221    i5050775
chrM    3221    i706536
chrM    3223    rs28735476
chrM    3242    rs193303018    <--- similar to current v5 delivery
chrM    3243    i5050984
chrM    3243    i705046
chrM    3243    rs199474657     <--- ditto
chrM    3251    rs199474662     <--- ditto
chrM    3252    i5050965
chrM    3254    i3002055
chrM    3254    i704734

which may seem to imply the previous v5 file that WGSE uses as a template is an error. But the API was developed and last accessible early in the v5 release.

We tend to deliver what the lab delivered. With the hope that tools that depend on reading the files are then expecting what the lab delivers as well. And if they are smart enough, correct for it if they can (or ignore it otherwise).

Sometimes it may be a mistake; sometimes later corrected. More often it is an issue with how they are defining the deliverable in the file versus what they are actually measuring with the probe. Sometimes the rsID itself changed, sometimes the definition of what the chip microarray probe is delivering changed, and sometimes both.

It is not clear if no calls are always delivered in that field from the samples we have. If so, we should really do the same or simply drop them from our template. We do not have enough samples to make that call.

Often these files were defined back in 2013 or earlier when the chips were developed. Some rsIDs are updated since then as refinements are made. Sometimes the rsID definition is changed and only applicable to a later patched version of the reference model. Which would require interpretation of the patch by all tools. Virtually no tools utilize patch updates to the models (and vendors in general do not map to the patched models). Patched models have not been modified from being a pure reference model to an "analysis" model that tools like aligners and variant callers utilize.

We are considering delivering two files. One that is accurate to what the vendor delivers(ed). And one where we think corrections are needed and so accurate to what the BAM represents. Without having a large enough sample over a large enough lab time period, it is difficult to do the latter (and even the former in some instances). There is no website to define the vendor microarray content. Most vendors seem to through out questionable content if they discover it is not reliable.