Module Request: PLINK/SEQ

Requested in #311 by @jcgrenier:

Another useful QC that could be integrated would be one using Genotyping array data. As sometimes we like to do analysis like eQTL analysis and use both genotyping data (as it costs less than generating whole exome or whole genome sequencing data) and RNAseq data. Some summary barplots using informations from this table

https://atgu.mgh.harvard.edu/plinkseq/stats.shtml#ind

Some information there could be useful to make sure all the samples are fine and not mixed with one another. Using the heterozygosity rate, we can detect rapidly if one or more sample could have any abnormalities.

Concerning the relatedness test on RNAseq data, as I'm only doing it on a QC basis, I'm pretty much following the recommended pipeline on GATK for SNP calling, but generate the vcf in GVCF format, and join them together after. The critical point after is the amout of missing data per SNP. As RNAseq data is pretty much variable accross the samples, and that those relatedness/IBD sharing tests are pretty much affected by the amount of missing data in the dataset. I'm normally only keeping positions with less than 50% of missing rate. What you are getting after is a tabulated file with multiple values.

INDV1 INDV2 N_AaAa N_AAaa N1_Aa N2_Aa RELATEDNESS_PHI

Where the three columns that interests us in the end are 1, 2, 7. The other ones are for identical/different genotypes and the number of points in each sample in the comparison.

Here's some more interesting information (taken from that post on biostar : https://www.biostars.org/p/111573/)

First-degree relatives are ~0.25, and 2nd-degree ~0.125, and 3rd degree 0.0625. "Unrelated" parents can reach values as high as ~0.04 in my experience.

So, in an analysis where we want to recover "identical" samples that should corresponds to the same individual, we should get something between 0.25 and 0.5 (0.5 is for identical sample).

MultiQC / MultiQC

Module Request: PLINK/SEQ #412