MultiQC / MultiQC

Aggregate results from bioinformatics analyses across many samples into a single report.
http://multiqc.info
GNU General Public License v3.0
1.2k stars 597 forks source link

Module Request: PLINK/SEQ #412

Open ewels opened 7 years ago

ewels commented 7 years ago

Requested in #311 by @jcgrenier:

Another useful QC that could be integrated would be one using Genotyping array data. As sometimes we like to do analysis like eQTL analysis and use both genotyping data (as it costs less than generating whole exome or whole genome sequencing data) and RNAseq data. Some summary barplots using informations from this table

https://atgu.mgh.harvard.edu/plinkseq/stats.shtml#ind

Some information there could be useful to make sure all the samples are fine and not mixed with one another. Using the heterozygosity rate, we can detect rapidly if one or more sample could have any abnormalities.

Concerning the relatedness test on RNAseq data, as I'm only doing it on a QC basis, I'm pretty much following the recommended pipeline on GATK for SNP calling, but generate the vcf in GVCF format, and join them together after. The critical point after is the amout of missing data per SNP. As RNAseq data is pretty much variable accross the samples, and that those relatedness/IBD sharing tests are pretty much affected by the amount of missing data in the dataset. I'm normally only keeping positions with less than 50% of missing rate. What you are getting after is a tabulated file with multiple values.

INDV1 INDV2 N_AaAa N_AAaa N1_Aa N2_Aa RELATEDNESS_PHI

Where the three columns that interests us in the end are 1, 2, 7. The other ones are for identical/different genotypes and the number of points in each sample in the comparison.

Here's some more interesting information (taken from that post on biostar : https://www.biostars.org/p/111573/)

First-degree relatives are ~0.25, and 2nd-degree ~0.125, and 3rd degree 0.0625. "Unrelated" parents can reach values as high as ~0.04 in my experience.

So, in an analysis where we want to recover "identical" samples that should corresponds to the same individual, we should get something between 0.25 and 0.5 (0.5 is for identical sample).

jcgrenier commented 7 years ago

Hello @ewels ,

Here's an example for that PLINK/SEQ run.

The goal here is to see if there's an imbalance in what we are expecting from a sample in term of the ratio for the homozygous alternative allele vs heterozygous alleles. To get this, just infer the homozygous from the column NHET and NVAR : NHOMALT=(NVAR - NHET) and do the ratio with the NHET : NHOMALT/NHET. If a sample is completely out of the average, it means there's probably a library issue or that this sample his having something special.

Runs_278-279.merge.miss50.istats.txt

JC