cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
41 stars 43 forks source link

When to intersect samples among genotype, phenotype and covariates #137

Open gaow opened 2 years ago

gaow commented 2 years ago

Complication

Genotype and covariates are possibly shared across all studies and phenotypes are unique to each one. Sometimes the overlap is large and the few non-overlapping samples are negligible and can be removed at any point in the analysis. Sometimes a phenotype can have much less samples than it is available in genotype data (as is the case for the data @hsun3163 is currently analyzing).

Preparation

We should create a look-up file of 2 columns:

sample_name_in_pheno(and cov), sample_name_in_geno

that takes only the OVERLAP between these data-set. This will also serve as a sample name matching file if sample names dont agree.

We ask users to provide this, in case they want to exclude samples for other reasons. Our analysis will be focused on these samples when applicatble

Genotype

Phenotype

Covariates

@hsun3163 Let me know what you think I'm missing

hsun3163 commented 2 years ago
  1. As outlined in #138 The second point in covariates does not work.

  2. For the Genotype: So we takes the samples from the output of readcount QC, filter the vcf, and then filter the output of readcount QC after pca is done?

gaow commented 2 years ago

yes to 2. except we dont filter VCF by subsets of samples; we only use them to generate phenotype specific PCA covariates

hsun3163 commented 2 years ago

Per the current design,PCA and the outlier based pca is determined based on the subset of samples.

However, I am quite concerning about the idea that determining whether certain samples is outlier based on a not random subset of samples seems off.

Imagine a case where the datapoint for samples A,B,C,D,E,F,G,H,I,J is {1,2,3,4,5,6,7,8,9,10}. No outlier is presented. However, say due to some reason we only select A,B,C, and J as our samples. Then we may end up with {1,2,3,10}. Where J is an outlier and may be mistakenly removed.

Hope I am wrong...but perhaps detecting and removing outliers should be based on the full samples instead of just 1?

gaow commented 2 years ago

@hsun3163 we will not remove outliers but we will recommend a list of them to users and ask them to decide eg.whether there's too few samples to confidently work out the outliers. We don't want to analyze it for all samples in case that different tissues indeed come from different cohorts and there is a population substructure between these cohorts

hsun3163 commented 2 years ago

As I understand, the benefit for removal of samples from phenotype data based on its availability in genotype data only benefit the step of normalization,(APEX and tensorQTL can handle the mismatch of samples already. PEER & BICV take intersect btw phenotype and covariate, which only have the samples in PCAs). Therefore, for external bed.gz input (i.e. the one that dont need to be normalized) perhaps there is no needs to intersect?

gaow commented 2 years ago

I agree with this assessment.