andrewparkermorgan / argyle

An R package for import, QC and analysis of Illumina Infinium genotyping arrays
31 stars 10 forks source link

marker map file for gigamuga array #5

Open danfulop opened 6 years ago

danfulop commented 6 years ago

In your vignette you state that: "Pre-computed [marker maps] for the MUGA series of arrays are available from [URL]."

What is the URL for downloading the marker map file?

Thanks!

Ehiole commented 6 years ago

You have to make it yourself from the SNP_Map.txt I think you will soon find that the genotype data is not properly read into the software, at least thats my experience.

andrewparkermorgan commented 6 years ago

Marker information for the MUGA platforms is available here:

http://csbio.unc.edu/MUGA/

I cannot say much about why data import may have failed in some cases. The manner of failure is informative though. That is, if an error occurs and the process does not complete, that signals one sort of problem. If the process completes but the genotypes (as read into R) are meaningless, that's a different problem and may reflect an issue with the allele specification in the marker map.

The code in argyle does make certain assumptions about the format of the Illumina BeadStudio output that our vendor returns in a file called *FinalReport.zip (or similar). But there are many was to export data from BeadStudio, and the vendor may have changed what they provide in some instances. It's not an ideal situation from a usability and stability perspective.

The required six columns are detailed in the documentation (see ?read.beadstudio), and as far as I know the column order should not matter.

Ehiole commented 6 years ago

Thanks for the quick response. It appears I made the snps file wrong and that was the culprit. Using the file from the site fixed the issues

amizeranschi commented 5 years ago

@Ehiole How can one get the A1 and A2 columns for the marker map, using the columns SNP, ILMN Strand and Customer Strand from the SNP_map.txt file?

I've just opened this related issue: https://github.com/andrewparkermorgan/argyle/issues/8.

Ehiole commented 5 years ago

I got the data out by first downloading the reference file from csbio.unc, extracting the 6 required columns and running the QC. See my code below, I can't guarantee this is what you need.

Download from http://csbio.unc.edu/MUGA/

download.file("http://csbio.unc.edu/MUGA/snps.gigamuga.Rdata", "snps.gigamuga.Rdata", method= "curl")

Load Marker map data into R

snps <- as.data.table( get(load("snps.gigamuga.Rdata")) )

Remove chr prefix and rename columns (can't remember if this is required...)

snps$chr <- gsub("chr", "", snps$chr) snps <- as.data.frame(snps) rownames(snps) <- snps$marker

Install Argyle package for QC and load

install.packages('devtools') library(devtools)

allow R to look for pacakges in both CRAN and Bioconductor

setRepositories(ind = 1:2)

install argyle from Github source

install_github("andrewparkermorgan/argyle")

library(argyle)

Read in your data and the Marker map data first 6 columns

geno <- read.beadstudio(prefix= 'Washington_Universit_Akhirome_MURGIGV0120180426', snps[,1:6] )

Do QC

geno <- run.sample.qc(geno) qcplot(geno)

amizeranschi commented 5 years ago

Thanks a lot for sharing this. However, I'm using a different array, for a different organism (Bos Taurus, i.e. cows) and I couldn't find any place to download a ready-made marker map for the array that I'm using. I have the SNP_Map.txt file, but I am not sure how to generate the A1 and A2 alleles for the marker map for Argyle.

In an earlier post from here (https://github.com/andrewparkermorgan/argyle/issues/5#issuecomment-386136537), you wrote that the marker map should be generated from the SNP_Map.txt file.

The SNP_map.txt file looks like this:

Index   Name    Chromosome  Position    GenTrain Score  SNP ILMN Strand Customer Strand NormID
1   ARS-BFGL-BAC-10919  14  31267746    0.7455  [A/G]   TOP TOP 0
2   ARS-BFGL-BAC-10975  10  21225382    0.7042  [A/G]   TOP TOP 0
3   ARS-BFGL-BAC-11000  10  79252023    0.8459  [T/G]   BOT BOT 0
4   ARS-BFGL-BAC-11003  10  80410977    0.8801  [T/C]   BOT BOT 0
5   ARS-BFGL-BAC-11025  10  84516867    0.856   [T/G]   BOT BOT 0
6   ARS-BFGL-BAC-11044  1   12805406    0.8861  [T/C]   BOT BOT 0
7   ARS-BFGL-BAC-11193  1   29303546    0.8123  [T/C]   BOT TOP 0
8   ARS-BFGL-BAC-11215  12  90704572    0.7441  [A/G]   TOP TOP 0
9   ARS-BFGL-BAC-11218  1   24549757    0.8716  [A/G]   TOP TOP 0
10  ARS-BFGL-BAC-11276  13  2153905     0.8155  [T/C]   BOT     TOP     0

How should the SNP column above be interpreted? I noticed that the alleles mentioned there are, sometimes, the reverse complement of the alleles that actually appear in the genotype values from the associated FinalReport.txt file.

andrewparkermorgan commented 5 years ago

Unfortunately the SNP column in the SNP_Map.txt can be hard to interpret. For reasons I do not completely understand, the vendor sometimes reports genotypes on the "plus" and sometimes on the "minus" strand. There are two strand columns in SNP_Map.txt, but again I'm not sure what they mean. I have more details for mouse arrays, but not for other species.

If you already have some genotypes in hand, it's probably easiest to just look at what alleles are segregating at each marker. You may have to contact the array vendor to find out more.

amizeranschi commented 5 years ago

Thank you for the reply and for reopening this issue. Any other potential clues you have from your experience with the mouse arrays could come in handy. I contacted the vendor and asked for some clarification, but haven't heard back from them so far.

If I may ask, do you know how the marker map was created for the file snps.gigamuga.Rdata that @Ehiole linked above? Specifically, where do the A1 and A2 come from?

I tried to correlate the information from the SNP_MAP.txt file with the stuff from the FinalReport.txt file (the first 10 rows are below, corresponding to the 10 SNPs from my previous post above). However, I'm finding some inconsistencies.

SNP Name    Sample ID   Allele1 - Forward   Allele2 - Forward   Allele1 - Top   Allele2 - Top   Allele1 - AB    Allele2 - AB    GC Score    X   Y
ARS-BFGL-BAC-10919  60688   A   G   A   G   A   B   0.7241  0.854   0.817
ARS-BFGL-BAC-10975  60688   A   G   A   G   A   B   0.6473  0.905   0.681
ARS-BFGL-BAC-11000  60688   T   T   A   A   A   A   0.8793  0.437   0.019
ARS-BFGL-BAC-11003  60688   T   C   A   G   A   B   0.9175  0.549   0.571
ARS-BFGL-BAC-11025  60688   T   T   A   A   A   A   0.8913  0.89    0.029
ARS-BFGL-BAC-11044  60688   T   C   A   G   A   B   0.9234  0.526   0.553
ARS-BFGL-BAC-11193  60688   G   G   G   G   B   B   0.834   0.016   0.414
ARS-BFGL-BAC-11215  60688   G   G   G   G   B   B   0.7216  0.038   0.675
ARS-BFGL-BAC-11218  60688   A   A   A   A   A   A   0.9087  1.249   0.049
ARS-BFGL-BAC-11276  60688   A   G   A   G   A   B   0.8386  0.911   0.692

Looking at the 7th SNP (ARS-BFGL-BAC-11193), for example, I find the value [T/C] on the SNP column in the SNP_Map.txt file from the post above, and then a couple of G alleles in the XYZ_FinalReport.txt file. What could be the reason for this? A similar thing happens with the 10th SNP (ARS-BFGL-BAC-11276).

The only thing I notice that sets these two SNPs apart from the rest (among the 10 that I posted above) is that they have the value BOT on the column ILMN Strand and TOP for the Customer Strand in the SNP_Map.txt file. All the other SNPs have either TOP or BOT on both columns.

What I'm trying to find out is, based on the information from the SNP, ILMN Strand and Customer Strand columns of the SNP_Map.txt file, how I can infer the reference (REF) and alternate (ALT) alleles, relative to the forward strand from the genome reference (which was Bos Taurus UMD3.1 for the SNP array that I used). I assume that these REF and ALT alleles should be used for the A1 and A2 values in the marker report for Argyle, is this correct?

I'm not sure if it is an option to simply use the alleles from the FinalReport.txt file for this, because it is possible that all of the genotyped animals could have the REF allele, for example, and then I wouldn't be able to infer the ALT allele. I'm guessing there must be a consistent way to compute the A1 and A2 alleles for Argyle's marker map using the data from the SNP_Map.txt file.

I'm only guessing here, but could it be that the A1 (REF) and A2 (ALT) alleles are the reverse complement of what is found in the SNP column from the SNP_Map.txt, when the values for the ILMN Strand and Customer Strand are different? Or could there be some other, similar simple rule to this?

I also came across this PDF file explaining how Illumina arrays designate the TOP and BOT alleles, but I didn't manage to correlate the info in there with the SNP, ILMN Strand and Customer Strand columns from the SNP_Map.txt file.

Update: I found a related post on a BioStars forum: https://www.biostars.org/p/92710.

Again, I'm noticing that all of the SNPs that the original poster mentioned are "flipped compared to the reference" have different TOP/BOT values for the ILMN Strand and Customer Strand columns of the SNP_Map.txt file.

andrewparkermorgan commented 5 years ago

For our mouse array(s), we know the SNPs we are targeting -- the position, alleles and expected genotypes for homozygous control individuals -- based on whole-genome sequencing. We designed the arrays so we also know the sequence of the invariant part of the probe oligo, and we can use the alignment of the probe sequence to the reference to figure out strand.

Part of the design process for us was to genotype a reasonably large collection of samples that cover all the alleles on the array. As a consequence we knew that any marker for which we didn't see both alleles did not work.

From a practical point of view, you just need a marker map in which A1 and A2 correspond to the alleles that the vendor reports. argyle does not care about strand, or which of A1 and A2 correspond to REF and ALT. What matters is that A1 and A2 are the only alleles present in the FinalReport.txt file for the corresponding marker. The purpose of enforcing this constraint is to be sure that genotypes and marker information don't get scrambled. Internally argyle just treats everything as a biallelic site with 0, 1 or 2 copies of an arbitrarily-defined allele.

To integrate your array genotypes to other data (like from sequencing) you of course must get back to the reference strand. I am sorry that I don't have any more insight to offer about the ILMN Strand and Customer Strand columns, or what exactly TOP and BOT means with regard to strand. If you can find enough different cases in your data -- or in data from other organisms on the same array platform (Illumina Infinium) -- I would think you could eventually decode the rules.

amizeranschi commented 5 years ago

Thanks for clarifying how to get things working in Argyle. I do intend to integrate the genotype data with other data, e.g. from sequencing, or to perform genotype imputation.

For the purpose of converting the data to Plink and then to VCF, do I first need to convert the A1 and A2 alleles from the marker map, as well as all the genotypes, to reflect the forward strand from the reference genome? If I figure out the conversion rules for each marker, would I be able to use Argyle to perform this?

andrewparkermorgan commented 5 years ago

When you convert the A1 and A2 reported for your array(s) to the reference forward strand is a matter of preference. I know that PLINK has a number of utilities for detecting and correcting possible strand swaps between datasets, but I find them rather opaque and cumbersome. It is important to remember that PLINK (at least prior to v2.0) would rearrange allele order (A1 vs A2) or drop alleles not observed in the current dataset, unless specific command-line flags were used. Those issues can be hard to catch when using a binary format like *.bed that is hard to inspect directly.

My own preference is to convert to disparate datasets to VCF, which has a nicer spec and can be inspected by eye, and then do the merging with bcftools. Alleles can be changed manually with your scripting language of choice or even plain awk or sed. If you'll be working with whole-genome sequencing data at all, you'll probably want things to end up in VCF anyway. For phasing tools that want PLINK binary input you can always convert back after everything is all tidied up.

The allele recoding can also be done within R. Once genotypes are loaded into the environment with argyle and converted to the 0/1/2 encoding (via recode(x, "01")), you can manually update the alleles in the marker map, and then export to PLINK format. For your array genotypes this may be the easiest way to go, but I would not recommend trying to load eg. a large haplotype reference panel with argyle.

I don't mean to disparage my own tool but it just isn't up to the task for anything "big data." Anything bigger than will comfortably fit in memory as a non-sparse matrix in R is likely to require something more heavy-duty. There are other R packages out there that do memory-backed matrices, chunked storage, etc. that can help you out there (eg. snpgds).

amizeranschi commented 5 years ago

Thanks, again, for the reply. I actually have more experience with sequencing data analysis and the VCF file format than with Illumina's microarrays, which is why I want to do this conversion in the first place. I'll check out argyle's recode method and report here if I run into any trouble. Thanks also for the heads-up on PLINK's behavior with A1 and A2. I'm trying to focus on PLINK versions 1.90 beta and 2.00 alpha as much as possible.

The microarray that we use has around 48k SNPs, so it should fit nicely in memory and I haven't had any trouble with argyle so far. I'll also check out other R packages, if needed, so thanks a bunch for mentioning snpgds as well.

sinamajidian commented 3 years ago

I'm working on almost the same issue. Any progress here? @amizeranschi May I contact you directly to ask questions?

amizeranschi commented 3 years ago

@smajidian I never solved the problem directly, so I can't really offer any help here, but I did manage to go around it by asking the genotyping lab to send us the results directly converted into the PLINK and VCF file formats.