Recommended filtering - Githubissues

iranmdl commented 5 months ago

Hi!

Thank you for this extremely useful tool.

I have just used gtc2vcf to convert 300 GTC files from Infinium Global Screening Array into VCF.

bcftools +gtc2vcf
  -Ov \
 --adjust-clusters \
  --bpm ${bpm_manifest_file} \
  --csv ${csv_manifest_file} \
  --egt ${egt_manifest_file}
  --gtcs ${gtc_folder} \
  --fasta-ref ${ref} \
  --extra ${prefix}.tsv \
  --output ${prefix}.vcf

Now I am trying to understand the different quality metrics and how to use them for downstream filtering. It is my first time working with arrays (all my experience is on WES/WGS), and I was recommended to follow the quality filters suggested in Strategies for processing and quality control of Illumina genotyping arrays. In that paper GenomeStudio is used, and I can see that for those SNPs with low GenTrain scores, they manually realign the cluster position and therefore GenTrain score increases. Is this something that should be done after running gtc2vcf? Or is the argument --adjust-clusters taking care of it?

Are there any recommended thresholds for filtering? GenTrain_Score threshold, Cluster_Sep... Also, I have noticed a lot of SNPs with GenTrain_Score 0, while Orig_Score has a good score, is this expected? E.g:

ID GenTrain_Score Orig_Score
10:135332149_CNV_CYP2E1 0 0.85
10:89725294_CNV_PTEN_e9_9 0 0.88
10:43625509_CNV_RET_e20_20 0 0.87

Thank you in advance!

freeseek commented 5 months ago

That cluster file information included in the VCF is only for informative purposes and I do not have much experience with it. When you convert with BCFtools/gtc2vcf you use the cluster file to compute the normalized intensities but you don't use it to recall the genotypes. Even if you use --adjust-clusters you might get better normalized intensities, but the genotypes will stay the same, so I do not recommend using it. You can use GenomeStudio to update your cluster centers and then use the iaap_cli to generate new genotypes with the updated cluster file, but you have to do that separately as BCFtools/gtc2vcf does not have a framework for generating cluster files. I personally never bothered to use cluster files other than those provided by Illumina. The variant QC I perform is only based on genotypes missingness and HWE

iranmdl commented 5 months ago

Thank you for the quick answer :). Ok, so you don't recommend --adjust-clusters? After reading the README I understood it was recommended: "If you convert hundreds of GTC files at once, you can use the --adjust-clusters option which will recenter the genotype clusters rather than using those provided in the EGT cluster file and will compute less noisy LRR values."

So the step of manual reviewing of the clusters has to be done using GenomeStudio (GS). I guess one could then export a reviewed cluster.egt file from GS, then run iaap_cli, and then bcftools/gtc2vcf? According to several links "Genome Studio’s automatic clustering algorithms are reported to be accurate for ~ 99 % of SNPs. The other ~ 1 % need to be manually reviewed", I was hoping to skip this manual reviewing using gtc2vcf, but I guess this is asking too much ! :)

freeseek commented 5 months ago

I have never used GenomeStudio so I have no experience with it. You can use --adjust-clusters if you want, but if you use BAF and LRR values with BCFtools/mocha, then it should not make a meaningful difference as BCFtools/mocha has its own approach to re-center the clusters on the fly

iranmdl commented 5 months ago

Aha! I see, thank you! And you only use genotypes missingness and HWE for variant filtering, no GenTrain_score, call frequency, genotype quality..?

freeseek commented 5 months ago

Genotypes will become missing when the genotype quality is too low (see iaap-cli option --gencall-cutoff) so the genotype quality is taken into account that way

iranmdl commented 5 months ago

Right, so genotype quality check is done when idat files are converted into gtc. Thanks!

I intend to use this VCF for phasing+imputation+GWAS, not the full mocha pipeline for the moment, and I am trying to figure out some standardized filtering criteria to do to the VCF file. For example, in WES/WGS, you can find common filtering cutoffs such as DP>=10, QUAL>=30.. etc. I'm all ears if you've got any tips or suggestions :)

Also, is there a metric in the INFO field of the VCF with Call_Freq information (the proportion of samples at each locus successfully genotyped)?

freeseek commented 5 months ago

As IDAT to GTC is a sample-by-sample conversion, you don't get statistics across samples. But you can easily compute those from the final VCF with different BCFtools plugins. I always perform phasing and imputation using mocha.wdl and impute.wdl from the MoChA WDL pipeline

iranmdl commented 5 months ago

Can MoChA WDL pipeline work in a cluster (to be specific, SLURM) or just in the cloud? I would love to give a try to the pipeline phasing and imputing submodules.

freeseek commented 5 months ago

It can work wherever you can get Cromwell to run. Most of my collaborators run it with SLURM. Detailed instructions for Cromwell setup are here

iranmdl commented 5 months ago

Thank you @freeseek ! I will give a try. One last question, could you provide a gtc file that I can use to test the pipeline? I tried to find a publicly available one but no success so far.

freeseek commented 5 months ago

For Illumina examples check here and here

freeseek / gtc2vcf

Recommended filtering #61