Open gaow opened 2 years ago
As outlined in #138 The second point in covariates does not work.
For the Genotype: So we takes the samples from the output of readcount QC, filter the vcf, and then filter the output of readcount QC after pca is done?
yes to 2. except we dont filter VCF by subsets of samples; we only use them to generate phenotype specific PCA covariates
Per the current design,PCA and the outlier based pca is determined based on the subset of samples.
However, I am quite concerning about the idea that determining whether certain samples is outlier based on a not random subset of samples seems off.
Imagine a case where the datapoint for samples A,B,C,D,E,F,G,H,I,J is {1,2,3,4,5,6,7,8,9,10}. No outlier is presented. However, say due to some reason we only select A,B,C, and J as our samples. Then we may end up with {1,2,3,10}. Where J is an outlier and may be mistakenly removed.
Hope I am wrong...but perhaps detecting and removing outliers should be based on the full samples instead of just 1?
@hsun3163 we will not remove outliers but we will recommend a list of them to users and ask them to decide eg.whether there's too few samples to confidently work out the outliers. We don't want to analyze it for all samples in case that different tissues indeed come from different cohorts and there is a population substructure between these cohorts
As I understand, the benefit for removal of samples from phenotype data based on its availability in genotype data only benefit the step of normalization,(APEX and tensorQTL can handle the mismatch of samples already. PEER & BICV take intersect btw phenotype and covariate, which only have the samples in PCAs). Therefore, for external bed.gz input (i.e. the one that dont need to be normalized) perhaps there is no needs to intersect?
I agree with this assessment.
Complication
Genotype and covariates are possibly shared across all studies and phenotypes are unique to each one. Sometimes the overlap is large and the few non-overlapping samples are negligible and can be removed at any point in the analysis. Sometimes a phenotype can have much less samples than it is available in genotype data (as is the case for the data @hsun3163 is currently analyzing).
Preparation
We should create a look-up file of 2 columns:
that takes only the OVERLAP between these data-set. This will also serve as a sample name matching file if sample names dont agree.
We ask users to provide this, in case they want to exclude samples for other reasons. Our analysis will be focused on these samples when applicatble
Genotype
Variant level QC should be based on all samples -- we have been doing that with the VCF pipeline but not yet the PLINK format input (we do that at the very end).
PCA derived from genotype data is ideally performed on each phenotype separately
@hsun3163 :
we should take overlapping samples right after VCF QC and before KING, to generate markers (MAF5%+, LD pruned) and compute PCA per study. We will then remove outliers based on PCA results from multiple studies. We remove them on the full genotype data.
The look-up file may be adjusted for outliers.
Phenotype
Covariates
@hsun3163 Let me know what you think I'm missing