R function to tell what bpm manifest was used to create an idat

ekarlins commented 7 years ago

@ngiangre, is it possible to write a function in R that will take an idat file as input and determine what array and version, or what bpm manifest was used to generate the idat? This could be really useful to see which samples can be combined for CNV calling. Our pipeline assumes that all idats will work with the same bpm manifest. It would be good to have a way to check this up front.

ngiangre commented 7 years ago

I am able to read in ideates and retrieve:

names(idats$201274980046_R04C02_Grn.idat) [1] "fileSize" "versionNumber" "nFields" "fields" "nSNPsRead" "Quants"
[7] "MidBlock" "RedGreen" "Barcode" "ChipType" "RunInfo" "Unknowns"

I don't think it tells me the exact bp manifest but it has other useful info. Do you want me to make like a metadata file?

ekarlins commented 7 years ago

Show what these fields contain please.

ngiangre commented 7 years ago

check files/output/idat_metadata.txt

ekarlins commented 7 years ago

@ngiangre, I don't see anything in files/output/idat_metadata.txt that seems to represent the array or manifest that can use. Which field do you think is unique to this?

ngiangre commented 7 years ago

Unique to the array? I'm not sure, these are the fields that are available . There isn't any other function that extracts other idat info

ekarlins commented 7 years ago

I discovered while making the "dictionary" for the R package "gsrc" that when you read in the idat files into R using "illuminaio" the SNPs names match the "AddressA_ID" column in the Illumina manifest. I'm wondering if this is the best way to tell what manifest was used for each idat.

It could be really useful in Scan2CNV to have an upfront test that an idat matches the manifest before starting the rest of the pipeline. It may also be useful for users to have a way of taking old data that they have with unknown manifests and matching to existing manifests.

I'm also wondering if there is partial matching of IDs in the idat and the "AddressA_ID" column in the Illumina manifest, if moving forward with the SNPs that match is appropriate. Maybe we should investigate what overlap, if any, there is of "AddressA_ID" across manifests.

For our purposes we may also want to check that all samples have the exact same SNPs in the idat files. Even if 90% of the SNPs are the same between samples, I don't think combining these samples is appropriate. There are probe to probe interactions that can change intensity data for a given probe depending on what other probes are on the array with it.

Points for follow up:

1) Every sample included in run of Scan2CNV must have the exact same SNP names when reading the idat with R, else throw an ERROR and exit.

2) If there is not one to one matching of SNPs in idats and SNPs in manifest, throw a WARNING and proceed.

3) Investigate overlap of "AddressA_ID" column across manifests, both from different SNP chips and different versions of the same SNP chip. Work on a separate function to try to match idat files to the closest manifest.

NCBI-Hackathons / Scan2CNV

R function to tell what bpm manifest was used to create an idat #22