Batch effects? - Githubissues

m-gall commented 5 years ago

Dear Penn developers, I am working on an Illumina array dataset. I have two batches of data which were run using same SNP chip chemistry (OncoArray) and sequencing centre, but processed a year apart.

I have identified strong batch affects in my data, which manifest as differences in the range of the LRR values for the two batches. For instance, for some true/known CNVs, when I look at the LRR values, I can see a shift in the LRR values suggesting the presence of a CNV, however the values are not high enough to cross the CN2 threshold required to be called by Penn.

The LRR values were generated in Genomestudio and then exported for analysis in Penn.

Is this something you have encountered and do you have any suggestions for how I might 'train' the HMM to account for these effects?

Thanks.

Victor0122 commented 5 years ago

I have similar problem. I am using Canine HD SNP chip from illumina. I try to run PennCNV at HPC. then I get this information from output log file. "Data from chromosome 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38 will not be used in analysis" I try to make a pfb file from GenomeStudio output and using the default hhall.hmm file. I think I have something wrong in my pfb file and also I want to know how to make a hmm file for costume SNP chip

Thanks

kaichop commented 5 years ago

For victor0122's problem, you can add -lastchr 38 to the argument, since by default the last chromosome is 22 (human). You do need to make your own PFB file using the compile_pfb.pl program.

For m-gall's question, there is no real good way to address, except to use the batch number as a covariate in downstream analysis. Normally you could try generate a new cluster file, and re-calculate LRR/BAF values using the cluster file; you have to do this within genome studio (i.e. instead of using default cluster file, you generate your own cluster file from your own sample). However, given that the two batches are one year apart, a lot of things may have changed such that re-clustering does not really help.

WGLab / PennCNV

Batch effects? #32