Open danfulop opened 6 years ago
You have to make it yourself from the SNP_Map.txt I think you will soon find that the genotype data is not properly read into the software, at least thats my experience.
Marker information for the MUGA platforms is available here:
I cannot say much about why data import may have failed in some cases. The manner of failure is informative though. That is, if an error occurs and the process does not complete, that signals one sort of problem. If the process completes but the genotypes (as read into R
) are meaningless, that's a different problem and may reflect an issue with the allele specification in the marker map.
The code in argyle
does make certain assumptions about the format of the Illumina BeadStudio output that our vendor returns in a file called *FinalReport.zip (or similar). But there are many was to export data from BeadStudio, and the vendor may have changed what they provide in some instances. It's not an ideal situation from a usability and stability perspective.
The required six columns are detailed in the documentation (see ?read.beadstudio
), and as far as I know the column order should not matter.
Thanks for the quick response. It appears I made the snps file wrong and that was the culprit. Using the file from the site fixed the issues
@Ehiole How can one get the A1
and A2
columns for the marker map, using the columns SNP
, ILMN Strand
and Customer Strand
from the SNP_map.txt file?
I've just opened this related issue: https://github.com/andrewparkermorgan/argyle/issues/8.
I got the data out by first downloading the reference file from csbio.unc, extracting the 6 required columns and running the QC. See my code below, I can't guarantee this is what you need.
download.file("http://csbio.unc.edu/MUGA/snps.gigamuga.Rdata", "snps.gigamuga.Rdata", method= "curl")
snps <- as.data.table( get(load("snps.gigamuga.Rdata")) )
snps$chr <- gsub("chr", "", snps$chr) snps <- as.data.frame(snps) rownames(snps) <- snps$marker
install.packages('devtools') library(devtools)
setRepositories(ind = 1:2)
install_github("andrewparkermorgan/argyle")
library(argyle)
geno <- read.beadstudio(prefix= 'Washington_Universit_Akhirome_MURGIGV0120180426', snps[,1:6] )
geno <- run.sample.qc(geno) qcplot(geno)
Thanks a lot for sharing this. However, I'm using a different array, for a different organism (Bos Taurus, i.e. cows) and I couldn't find any place to download a ready-made marker map for the array that I'm using. I have the SNP_Map.txt file, but I am not sure how to generate the A1
and A2
alleles for the marker map for Argyle.
In an earlier post from here (https://github.com/andrewparkermorgan/argyle/issues/5#issuecomment-386136537), you wrote that the marker map should be generated from the SNP_Map.txt file.
The SNP_map.txt file looks like this:
Index Name Chromosome Position GenTrain Score SNP ILMN Strand Customer Strand NormID
1 ARS-BFGL-BAC-10919 14 31267746 0.7455 [A/G] TOP TOP 0
2 ARS-BFGL-BAC-10975 10 21225382 0.7042 [A/G] TOP TOP 0
3 ARS-BFGL-BAC-11000 10 79252023 0.8459 [T/G] BOT BOT 0
4 ARS-BFGL-BAC-11003 10 80410977 0.8801 [T/C] BOT BOT 0
5 ARS-BFGL-BAC-11025 10 84516867 0.856 [T/G] BOT BOT 0
6 ARS-BFGL-BAC-11044 1 12805406 0.8861 [T/C] BOT BOT 0
7 ARS-BFGL-BAC-11193 1 29303546 0.8123 [T/C] BOT TOP 0
8 ARS-BFGL-BAC-11215 12 90704572 0.7441 [A/G] TOP TOP 0
9 ARS-BFGL-BAC-11218 1 24549757 0.8716 [A/G] TOP TOP 0
10 ARS-BFGL-BAC-11276 13 2153905 0.8155 [T/C] BOT TOP 0
How should the SNP
column above be interpreted? I noticed that the alleles mentioned there are, sometimes, the reverse complement of the alleles that actually appear in the genotype values from the associated FinalReport.txt file.
Unfortunately the SNP
column in the SNP_Map.txt
can be hard to interpret. For reasons I do not completely understand, the vendor sometimes reports genotypes on the "plus" and sometimes on the "minus" strand. There are two strand columns in SNP_Map.txt
, but again I'm not sure what they mean. I have more details for mouse arrays, but not for other species.
If you already have some genotypes in hand, it's probably easiest to just look at what alleles are segregating at each marker. You may have to contact the array vendor to find out more.
Thank you for the reply and for reopening this issue. Any other potential clues you have from your experience with the mouse arrays could come in handy. I contacted the vendor and asked for some clarification, but haven't heard back from them so far.
If I may ask, do you know how the marker map was created for the file snps.gigamuga.Rdata
that @Ehiole linked above? Specifically, where do the A1
and A2
come from?
I tried to correlate the information from the SNP_MAP.txt
file with the stuff from the FinalReport.txt
file (the first 10 rows are below, corresponding to the 10 SNPs from my previous post above). However, I'm finding some inconsistencies.
SNP Name Sample ID Allele1 - Forward Allele2 - Forward Allele1 - Top Allele2 - Top Allele1 - AB Allele2 - AB GC Score X Y
ARS-BFGL-BAC-10919 60688 A G A G A B 0.7241 0.854 0.817
ARS-BFGL-BAC-10975 60688 A G A G A B 0.6473 0.905 0.681
ARS-BFGL-BAC-11000 60688 T T A A A A 0.8793 0.437 0.019
ARS-BFGL-BAC-11003 60688 T C A G A B 0.9175 0.549 0.571
ARS-BFGL-BAC-11025 60688 T T A A A A 0.8913 0.89 0.029
ARS-BFGL-BAC-11044 60688 T C A G A B 0.9234 0.526 0.553
ARS-BFGL-BAC-11193 60688 G G G G B B 0.834 0.016 0.414
ARS-BFGL-BAC-11215 60688 G G G G B B 0.7216 0.038 0.675
ARS-BFGL-BAC-11218 60688 A A A A A A 0.9087 1.249 0.049
ARS-BFGL-BAC-11276 60688 A G A G A B 0.8386 0.911 0.692
Looking at the 7th SNP (ARS-BFGL-BAC-11193), for example, I find the value [T/C]
on the SNP
column in the SNP_Map.txt
file from the post above, and then a couple of G
alleles in the XYZ_FinalReport.txt file. What could be the reason for this? A similar thing happens with the 10th SNP (ARS-BFGL-BAC-11276).
The only thing I notice that sets these two SNPs apart from the rest (among the 10 that I posted above) is that they have the value BOT
on the column ILMN Strand
and TOP
for the Customer Strand
in the SNP_Map.txt
file. All the other SNPs have either TOP
or BOT
on both columns.
What I'm trying to find out is, based on the information from the SNP
, ILMN Strand
and Customer Strand
columns of the SNP_Map.txt
file, how I can infer the reference (REF) and alternate (ALT) alleles, relative to the forward strand from the genome reference (which was Bos Taurus UMD3.1 for the SNP array that I used). I assume that these REF and ALT alleles should be used for the A1
and A2
values in the marker report for Argyle, is this correct?
I'm not sure if it is an option to simply use the alleles from the FinalReport.txt
file for this, because it is possible that all of the genotyped animals could have the REF allele, for example, and then I wouldn't be able to infer the ALT allele. I'm guessing there must be a consistent way to compute the A1
and A2
alleles for Argyle's marker map using the data from the SNP_Map.txt
file.
I'm only guessing here, but could it be that the A1
(REF) and A2
(ALT) alleles are the reverse complement of what is found in the SNP
column from the SNP_Map.txt
, when the values for the ILMN Strand
and Customer Strand
are different? Or could there be some other, similar simple rule to this?
I also came across this PDF file explaining how Illumina arrays designate the TOP
and BOT
alleles, but I didn't manage to correlate the info in there with the SNP
, ILMN Strand
and Customer Strand
columns from the SNP_Map.txt
file.
Update: I found a related post on a BioStars forum: https://www.biostars.org/p/92710.
Again, I'm noticing that all of the SNPs that the original poster mentioned are "flipped compared to the reference" have different TOP/BOT
values for the ILMN Strand
and Customer Strand
columns of the SNP_Map.txt
file.
For our mouse array(s), we know the SNPs we are targeting -- the position, alleles and expected genotypes for homozygous control individuals -- based on whole-genome sequencing. We designed the arrays so we also know the sequence of the invariant part of the probe oligo, and we can use the alignment of the probe sequence to the reference to figure out strand.
Part of the design process for us was to genotype a reasonably large collection of samples that cover all the alleles on the array. As a consequence we knew that any marker for which we didn't see both alleles did not work.
From a practical point of view, you just need a marker map in which A1
and A2
correspond to the alleles that the vendor reports. argyle
does not care about strand, or which of A1
and A2
correspond to REF and ALT. What matters is that A1
and A2
are the only alleles present in the FinalReport.txt
file for the corresponding marker. The purpose of enforcing this constraint is to be sure that genotypes and marker information don't get scrambled. Internally argyle
just treats everything as a biallelic site with 0, 1 or 2 copies of an arbitrarily-defined allele.
To integrate your array genotypes to other data (like from sequencing) you of course must get back to the reference strand. I am sorry that I don't have any more insight to offer about the ILMN Strand
and Customer Strand
columns, or what exactly TOP
and BOT
means with regard to strand. If you can find enough different cases in your data -- or in data from other organisms on the same array platform (Illumina Infinium) -- I would think you could eventually decode the rules.
Thanks for clarifying how to get things working in Argyle. I do intend to integrate the genotype data with other data, e.g. from sequencing, or to perform genotype imputation.
For the purpose of converting the data to Plink and then to VCF, do I first need to convert the A1
and A2
alleles from the marker map, as well as all the genotypes, to reflect the forward strand from the reference genome? If I figure out the conversion rules for each marker, would I be able to use Argyle to perform this?
When you convert the A1
and A2
reported for your array(s) to the reference forward strand is a matter of preference. I know that PLINK
has a number of utilities for detecting and correcting possible strand swaps between datasets, but I find them rather opaque and cumbersome. It is important to remember that PLINK
(at least prior to v2.0) would rearrange allele order (A1
vs A2
) or drop alleles not observed in the current dataset, unless specific command-line flags were used. Those issues can be hard to catch when using a binary format like *.bed
that is hard to inspect directly.
My own preference is to convert to disparate datasets to VCF, which has a nicer spec and can be inspected by eye, and then do the merging with bcftools
. Alleles can be changed manually with your scripting language of choice or even plain awk
or sed
. If you'll be working with whole-genome sequencing data at all, you'll probably want things to end up in VCF anyway. For phasing tools that want PLINK
binary input you can always convert back after everything is all tidied up.
The allele recoding can also be done within R. Once genotypes are loaded into the environment with argyle
and converted to the 0/1/2 encoding (via recode(x, "01")
), you can manually update the alleles in the marker map, and then export to PLINK
format. For your array genotypes this may be the easiest way to go, but I would not recommend trying to load eg. a large haplotype reference panel with argyle
.
I don't mean to disparage my own tool but it just isn't up to the task for anything "big data." Anything bigger than will comfortably fit in memory as a non-sparse matrix in R is likely to require something more heavy-duty. There are other R packages out there that do memory-backed matrices, chunked storage, etc. that can help you out there (eg. snpgds
).
Thanks, again, for the reply. I actually have more experience with sequencing data analysis and the VCF file format than with Illumina's microarrays, which is why I want to do this conversion in the first place. I'll check out argyle's recode
method and report here if I run into any trouble. Thanks also for the heads-up on PLINK
's behavior with A1
and A2
. I'm trying to focus on PLINK
versions 1.90 beta
and 2.00 alpha
as much as possible.
The microarray that we use has around 48k SNPs, so it should fit nicely in memory and I haven't had any trouble with argyle
so far. I'll also check out other R packages, if needed, so thanks a bunch for mentioning snpgds
as well.
I'm working on almost the same issue. Any progress here? @amizeranschi May I contact you directly to ask questions?
@smajidian I never solved the problem directly, so I can't really offer any help here, but I did manage to go around it by asking the genotyping lab to send us the results directly converted into the PLINK and VCF file formats.
In your vignette you state that: "Pre-computed [marker maps] for the MUGA series of arrays are available from [URL]."
What is the URL for downloading the marker map file?
Thanks!