When to intersect samples among genotype, phenotype and covariates

gaow commented 2 years ago

Complication

Genotype and covariates are possibly shared across all studies and phenotypes are unique to each one. Sometimes the overlap is large and the few non-overlapping samples are negligible and can be removed at any point in the analysis. Sometimes a phenotype can have much less samples than it is available in genotype data (as is the case for the data @hsun3163 is currently analyzing).

Preparation

We should create a look-up file of 2 columns:

sample_name_in_pheno(and cov), sample_name_in_geno

that takes only the OVERLAP between these data-set. This will also serve as a sample name matching file if sample names dont agree.

We ask users to provide this, in case they want to exclude samples for other reasons. Our analysis will be focused on these samples when applicatble

Genotype

Variant level QC should be based on all samples -- we have been doing that with the VCF pipeline but not yet the PLINK format input (we do that at the very end).
PCA derived from genotype data is ideally performed on each phenotype separately

@hsun3163 :
we should take overlapping samples right after VCF QC and before KING, to generate markers (MAF5%+, LD pruned) and compute PCA per study. We will then remove outliers based on PCA results from multiple studies. We remove them on the full genotype data.
The look-up file may be adjusted for outliers.

Phenotype

using the look-up file we remove samples after QC before normalization and gene data annotation
in the rNA-seq normalization pipeline, the required sample_lookup_file should be derived from our look-up file (if not used as is)

Covariates

Covariate data filter should happen before factor analysis, using the look-up file
for APEX, we can use this look-up file to create on the fly a VCF file with header only, https://github.com/hsun3163/neuro-apex/issues/1#issuecomment-876715665

@hsun3163 Let me know what you think I'm missing

hsun3163 commented 2 years ago

As outlined in #138 The second point in covariates does not work.
For the Genotype: So we takes the samples from the output of readcount QC, filter the vcf, and then filter the output of readcount QC after pca is done?

gaow commented 2 years ago

yes to 2. except we dont filter VCF by subsets of samples; we only use them to generate phenotype specific PCA covariates

hsun3163 commented 2 years ago

Per the current design,PCA and the outlier based pca is determined based on the subset of samples.

However, I am quite concerning about the idea that determining whether certain samples is outlier based on a not random subset of samples seems off.

Imagine a case where the datapoint for samples A,B,C,D,E,F,G,H,I,J is {1,2,3,4,5,6,7,8,9,10}. No outlier is presented. However, say due to some reason we only select A,B,C, and J as our samples. Then we may end up with {1,2,3,10}. Where J is an outlier and may be mistakenly removed.

Hope I am wrong...but perhaps detecting and removing outliers should be based on the full samples instead of just 1?

gaow commented 2 years ago

@hsun3163 we will not remove outliers but we will recommend a list of them to users and ask them to decide eg.whether there's too few samples to confidently work out the outliers. We don't want to analyze it for all samples in case that different tissues indeed come from different cohorts and there is a population substructure between these cohorts

hsun3163 commented 2 years ago

As I understand, the benefit for removal of samples from phenotype data based on its availability in genotype data only benefit the step of normalization,(APEX and tensorQTL can handle the mismatch of samples already. PEER & BICV take intersect btw phenotype and covariate, which only have the samples in PCAs). Therefore, for external bed.gz input (i.e. the one that dont need to be normalized) perhaps there is no needs to intersect?

gaow commented 2 years ago

I agree with this assessment.

cumc / xqtl-protocol