hallucination? - Githubissues

erleyva commented 5 months ago

I've been looking into this for about 2 hours now and I'm reaching a knowledge plateau. However, what I see seems to point towards something not working right and bringing a false positive.

While checking out Promethease, I was getting a positive 35x for Crohn's Disease. rs2066847(C;C)

I wasn't able to find that snp on gene.iobio.io which is the platform that nebula uses to explore mutations. So I checked the Mirror file produced by WGS Extract, and I see the SNP twice. Not only that, but it also has the old SNP rs5743293 twice which was merged into rs2066847 according to https://www.snpedia.com/index.php/rs5743293.

I found it odd that the SNP shows up twice with 2 different codes, CC (Crohn's Disease) and GG (No problemo). So I'm still wondering, am I in trouble or not lol.

Then I went to explore the genome itself on IGV using the CRAM file to get as accurate as I could. went to the same position where this mutation is supposed to be found at position Chr16, 50729867. https://www.snpedia.com/index.php/Rs2066847. I see an interesting situation there. it's mostly GG, but there is a C and a T. I double checked on the Nebula genome browser and sure enough, same thing.

I'm thinking the expected result in this case would be to show only the GG variant. I'm also fairly new at this, so there could be something I didn't properly understand.

RandyHarr commented 5 months ago

(recreating) Thanks for your inquiry.

First, realize that most microarray files are existing in build 37 reference genome coordinates. Your supposed capture of a portion of a file is in build 38. Not sure where you would have gotten that from. Not sure what you mean by "Mirror file produced by WGS Extract".

Microarray file content has been defined as far back as 2005. Depending on Illumina's release to a lab vendor and that lab vendors updates, what is captured there may vary even if the actual content does not. Illumina does not even define exactly what probes return. So what labels (rsID), coordinate and actual alleles appear in a file can vary with time and vendor. Even in cases of recognizable errors, we tend to adhere to what the lab returns,

Variations can occur overtime in how an rsID or coordinate is defined without there actually being a change to the probe or the read value that is reported by a lab. This can occur as the rsID definition is changed from reverse to forward read, a definition of "alignment (left vs right) is made (especially for InDel probes), and simply a correction for a found error. So the issue is trying to understand (when no clear definition exists) what the microarray chip probe is reading and how the lab is reporting it. Some entries even include a sequence (chromosome) ID of zero and/or coordinate of zero with a measured allele.

rs5743293 appears in 23andMe files (only). It appears with coordinate 50,763,781 in v3 and 50,763,782 in v5 and API. rs2066847 appears in 23andMe API, v3, v4; Ancestry v2 and FTDNA v3 files. With coordinate 50,763,778 in all except coordinate 50,763,779 in 23andMe v4.

The CombinedKit we define simply tries to merge the results from all the vendors. And so duplicate coordinates and/or duplicate rsID entries may result due to these differences in use in the various vendor files. Most likely, we should try and reduce this to a common or likely definition. But even in 23andMe v3 result from a lab using an actual microarray chip, they have 27 duplicate coordinates that each have a unique label (rsID or similar).

Some vendors that read microarray files are "intelligent". They adjust what they read for the apparent lab and version. Others are "dumb" in that they blindly read what they receive. Sometimes taking only the first or last occurrence of a label/rsID, coordinate or similar. When the duplicates occur in the original vendor files, let alone the WGSE CombinedKit merger, it becomes difficult to decide what to include and how.

erleyva commented 5 months ago

Thank you so much for looking into it. I appreciate the explanation on why duplicates happen and the different versions.

Yes, I meant Microarray when I wrote mirror.

I'm a little confused by the build 37 and build 38 piece. My CRAM references Build 38. There was a note in the manual saying this shouldn't be a problem.

Please correct me if I'm wrong here. But what I'm gathering is that the combined kit should still be the best microarray file to upload into Promethease

RandyHarr commented 5 months ago

WGSE will call using the build38 reference for a build38 BAM/CRAM. And then do a liftover of the values back to build37. So the coordinate positions given should be for 37. Your image: indicated build38 coordinate, microarray format file. Was not sure how you got a hold of that. CombinedKit is better for any site that can accept more than a basic microarray file result. It helps provide a greater coverage of the genome and alleles that a tool may look for. But, as you discovered, it could introduce confusion if the different lab vendor versions have conflicting information. We should likely do a better job to weed those out as CombinedKit is an artificial microarray file not coming from any particular lab.

erleyva commented 5 months ago

I see what you mean now. I think I see where those came from. The first time I ran the extract, something went wrong and either I had to force close WGSE, or it wasn't responding anymore, can't remember which one. However, the combined kit file did get created. That's the one I posted the screenshot of (inadvertedly, as I meant to screenshot the second file when I generated again which didn't have any issues). I'd speculate that the conversion to build37 weren't done yet when I force closed the process. Now that I opened the "good file", I see different positions which are probably the build 37 ones.

WGSExtract / WGSExtract.github.io

hallucination? #24