eblerjana / pangenie

Pangenome-based genome inference
MIT License
102 stars 10 forks source link

Filtering variants #14

Open JosephLalli opened 2 years ago

JosephLalli commented 2 years ago

Hi there,

Thanks for this great tool. I noticed that, in both the HPRC reference release paper and the original paper describing this tool, you employ complex pre and post pangenie variant filtering:

"We used vcfbub (version 0.1.0, (Garrison, 2021)) with parameters -l 0 and -r 100000 in order to filter the VCF..." "We additionally remove all records for which more than 80% of all 88 haplotypes carry a missing allele..." "we have implemented a decomposition approach which aims at detecting all variant alleles nested inside of multi-allelic top-level records..." "[We] defined a high quality subset of SV alleles that we can reliably genotype. For this purpose, we applied a machine learning approach similar to what we have presented previously..."

How important are these methods to the quality/reliability of your final genotypes, especially for large SV (> 50bp)? Does this package implement these filters, and if not, is there code that does? Do you have pre-trained models to use for the machine learning/regression based filter?

Just trying to get high-quality SV genotypes on my diverse short-read dataset, so any help is greatly appreciated.

Thanks again! -Joe

eblerjana commented 2 years ago

Hi Joe,

the pre-filtering is done in order to remove positions for which too the input haplotypes carry too many missing alleles. Since this haplotype information is important for PanGenie to make accurate positions, we typically remove sites for which more than 20% of the haplotypes carry a missing allele (note that there is a typo in the HPRC preprint, the 80% should be 20%, as we used the same strategy used for the HGSVC project). The filtering based on vcfbub, as well as the decomposition approach are specific to the HPRC graphs, more specifically to the VCFs generated based on snarl traversals of these graphs. The reason why the decomposition is needed for downstream analyses is described in detail in the preprint, and is done after genotyping bubbles with PanGenie. The approach used to filter the final genotypes helps to remove poor quality genotypes computed for a cohort of samples. It uses metrics like the mendelian consistency, genotype quality or the re-genotyping performance of samples present in the graph. Therefore, this approach only works if genotypes of a large cohort (like the 1000 Genomes samples) are available, that include trios. The pipelines used for HPRC genotyping and filtering are available at: https://bitbucket.org/jana_ebler/hprc-experiments/src/master/genotyping-experiments/, including the regression-based filtering pipeline for genotypes of the 1000 Genomes samples (https://bitbucket.org/jana_ebler/hprc-experiments/src/master/genotyping-experiments/workflow/rules/population-evaluation.smk). The latter computes the statistics needed for the regression, and runs the final filtering script that applies this model (https://bitbucket.org/jana_ebler/hprc-experiments/src/master/genotyping-experiments/workflow/scripts/analysis.py). Parameters used in this script might have to be adapted if applied to a cohorts of different sizes.

If you have genotyping data of a larger cohort (> 100 samples), I'd suggest using a similar regression-based filtering approach as we used for HPRC. If you want to filter genotypes computed for single samples, I'd suggest to use the genotype quality value for filtering. We typically use a threshold of 200 to extract relatively strict sets of high quality genotypes, but smaller values can be used to generate more lenient sets.

I hope this helps.

Best, Jana

JosephLalli commented 2 years ago

This is very helpful, thank you! I am relatively new to graph genomics, but I am trying to learn.

My dataset is over 100 genomes, but with no trios. It sounds like a regression based filtering approach is the better way to go.

I guess my last question is, were these quality filters applied to all of your analyses in the pan genome preprint? If so, how important are they? Would I expect a significant increase in my false positive rate if I use pangenie without using a regression model to filter its calls?

eblerjana commented 2 years ago

The pre-filtering was done to prepare the input VCF used for all analyses. For post-filtering, the details are described in the Methods section ("SV genotyping with PanGenie", page 65): the results shown for the "leave-one-out" experiment are based on the unfiltered sets and include all variants (Supplementary Figure 28). The comparisons to HGSVC genotypes and the Illumina-based SV calls (Figure 6C,D,E) are based on the "filtered set", so after applying the regression-based filtering approach. The comparison the medically relevant SVs is based on the unfiltered set.

We additionally show some statistics for the unfiltered set (version of Fig. 6C for the unfiltered set) in Supplementary Figures 29-31 (page 24-26 in supplement). These results show that even for the unfiltered set results look quite good, but the regression approach helps to remove some variants poorly typed across the cohort (especially for "complex" variants). So the conclusion is that the filtering definitely helps to remove false positive genotypes, especially in complex regions of the graph. If it is necessary in your case is hard to say, because it depends on the structure of your input graph and how accurate it is. If it contains many large bubbles with many different alternative alleles (like the HPRC graphs), filtering is definitely useful. As I mentioned before, simply filtering based on the quality is always an option as well (see the PanGenie paper).

JosephLalli commented 2 years ago

OK, I think I've got it now. I suggest incorporating the regression filter into PanGenie somehow, since it seems important to PanGenie's impressively low false positive/false negative rates.

I'll let you know if I encounter any trouble implementing these tools. If I have time, I will try to add the filter to PanGenie's workflow and submit it as a pull request.

eblerjana commented 2 years ago

The problem is that the current version of the regression-based filtering pipeline is very specific for cases where one has cohort genotypes including trios as well as genotypes for the panel samples contained in the graph given as input to PanGenie. The trios contribute features based on mendelian consistency. The genotypes for the graph samples are used to measure the accuracy of the genotyping: since these samples are in the graph, we know their true genotypes and can compare the predicted genotypes to the true ones. Although this accuracy measurement is too optimistic (because PanGenie uses the haplotypes of these samples when computing their genotypes), it helps to identify variants for which the genotyping fails.

So I don't think adding the filtering directly into PanGenie is helpful in general. What I can do is provide a separate filtering pipeline for such specific use cases to the pipelines/ folder. I have created a new branch where I have started adding a standalone version of the HPRC filtering pipeline (branch: filtering-pipeline). I have not had time yet to fully test and debug it, so there might still be issues, but maybe it is useful. Note that this pipeline requires bi-allelic VCFs as input, you can generate them by simply running bcftools view -m-any on both the VCF with the genotypes as well as the graph-VCF. I will finish it once I'm back from vacation.

However, since you don't have trio data, I'm not sure if this pipeline is useful in its current form. If you have genotyped the samples present in the graph in addition to the cohort samples, you might still be able to use a modified version of it by taking out the features based on mendelian consistency. Otherwise if you don't have these additional genotypes, there aren't any features left except for the genotype quality, so in that case, filtering based on the quality might be the best way to go.

I hope all of this helps you to set up a filtering pipeline for your data. I'll be offline for the next 2.5 weeks, so apologies in advance for not answering during that time.

JosephLalli commented 2 years ago

I'm using the draft human reference pangenome freeze yr1 from https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1 as my reference graph, aligning individual samples to the draft pangenome using vg giraffe, and calling variants from a surjected vcf using deepvariant.

I want to then use vg call to call larger indels and SVs.

Finally, I plan on using bcftools to concat these vcfs, and use pangenie to re-call larger indels/SVs.

Since I have the genotypes of the reference graph, I think I can use those as the ground truth dataset. I don't think there are any trios in the draft reference, but I should probably double check that. I definitely don't have any trios in my (very unrelated, very genetically diverse) dataset.

Thank you so much for your help and efforts. Enjoy your vacation, and I'll do my best to let you know if things work on my end.

JD12138 commented 1 year ago

Hi, I am using the minicac-graph vcf file which is ready for panGenie. I noticed that there are no variants on chromosome Y in the vcf file that ready for panGenie(https://zenodo.org/record/6797328/files/cactus_filtered_ids.vcf.gz). I am confused. Could you tell me the reason?

eblerjana commented 1 year ago

chrY variants were filtered out in the preprocessing step. PanGenie's genotyping model is designed for diploid genotypes. chrY is haploid and in addition, it does not have any recombination at all. Therefore, the current version of PanGenie is not suited for genotyping chrY.

JosephLalli commented 1 year ago

Hi there, just wanted to apologize for this question. Reading it back now, I was confused about how your work generated and utilized the quality filters presented in the paper. While it initially seemed complex to me, in hindsight all you are doing is taking all the variants used to produce the reference graph genome, and scoring the ability of pangenie to accurately genotype those variants. It's almost like a mappability score.

I do think it would be helpful to end users if you posted a link to the original paper's scores (I think I found them here? http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v1.0/PanGenie_filters/). Have you calculated these scores for the 44 sample HPRC reference? If so, could you post a link to them?

eblerjana commented 1 year ago

Yes, the link you sent contains the scores for the HGSVC freeze 3 genotypes. An updated version (HGSVC freeze 4) is available here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v2.0/PanGenie_filters/. For HPRC, this data can be found here: https://doi.org/10.5281/zenodo.6797328

JosephLalli commented 1 year ago

Thank you!

Any chance you have these in CHM13v2/T2T coordinates? Or would you recommend lifting them over to T2T?

eblerjana commented 1 year ago

No we don't have CHM13 coordinates unfortunately. In order to get those, we'd have to re-run the 1000 Genomes genotyping based on the CHM13-based HPRC VCFs, but this has not been done yet. I also would not recommend doing a liftover, especially not for the SVs.

JosephLalli commented 1 year ago

Hi @eblerjana, I just wanted to loop you in on a project of mine.

re-run the 1000 Genomes genotyping based on the CHM13-based HPRC VCFs

I'm currently doing exactly this. It should be finished in 2-3 weeks. I'll be sure to put the resulting variant calls and variant scores on Zenodo for you to incorporate into the Pangenie project.

eblerjana commented 1 year ago

Did you use the same filtering steps we used in the paper to prepare the CHM13-based VCFs? We recently ran the CHM13-based genotyping on a subset of 300 samples, and I already have the preprocessed Minigraph-Cactus VCF on CHM13 that can be used as input to PanGenie (don't use the unfiltered ones!). In case you need it, let me know. Here is the pipeline I wrote for 1000G genotyping based on CHM13: https://bitbucket.org/jana_ebler/hprc-experiments/src/chm13-based-pipeline/genotyping-experiments/. With snakemake preprocessing --use-conda -j <number of cores> you can produce the input VCF needed for PanGenie, with snakemake population_typing --use-conda -j <number of cores> you can run the 1000G genotyping. We are also currently working on re-running PanGenie on full 1000G cohort based CHM13-based data, but right now I don't have an estimation yet on when exactly we'll have the final results (I'm doing some preliminary testing currently).

eblerjana commented 1 year ago

Note: I just added links to existing input VCFs for PanGenie to the README page (section: "Existing reference panels to use with PanGenie"). One of them is the CHM13-based HPRC VCF (which was produced from the MC-VCF).

JosephLalli commented 1 year ago

Hi @eblerjana!

I'm actually using a CHM13-based reference VCF that I made last August. I tried my best to follow your GRCh38 reference vcf workflow.

An initial comparison suggests that there are a number of differences between the files, which is obviously troubling to me. The node names are completely different. This is likely due to the fact that I used the June 2022 version of the reference draft, which isn't an official release (I found it in the scratch folder of the pangenome s3 bucket).

I've posted a stub repository containing the scripts I used to generate this reference file. I'm curious if you have any major concerns about using this reference vcf, or if it is similar enough to be used as a resource by others.

JosephLalli commented 1 year ago

And I've just uploaded a Zenodo repository of the reference pangenome vcf that I'm using, along with the tools I'm using to run Pangenie on our local high-throughput computing cluster.

eblerjana commented 1 year ago

To me it looks like the underlying VCFs are quite different, they might have been constructed from different versions of the graph (given that the set of variants and the node IDs seem to be different). I was using the official release (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-chm13.vcf.gz). The construction of the graphs and VCFs was done by co-authors of HPRC, so unfortunately I don't know any details about the files in these scratch directories, but they might contain a lot of preliminary/experimental data

eblerjana commented 1 year ago

I also saw that you are running jellyfish separately, which file are you providing for --if? Is it just the (CHM13-based) reference? If yes, make sure to run it using the "path segments" FASTA file for --if(see https://github.com/eblerjana/pangenie/issues/11). PanGenie internally runs jellyfish with the --if argument to save time and space for k-mer counting, since only k-mers also contained in the provided file are counted in the reads (instead of all read k-mers). The file given to --ifis a FASTA file containing the reference and all alternative alleles contained in the graph. This file is produced by PanGenie prior to k-mer counting. So using only the reference would make it miss most of the relevant k-mers specific to variation, which might have a huge effect on the quality of the results. If you want to run jellyfish separately, you can try to follow the instructions (https://github.com/eblerjana/pangenie/issues/11), but the recommended way is to not run it separately.

Wangchangsh commented 1 year ago

Hi, Are there downstream filtering software or codes to support GQ from 0 to 10000? VCFtools only supports 0 to 99.

found that bcftools met my needs

Best, Changsheng

eblerjana commented 1 year ago

@JosephLalli CHM13-based genotypes for all 1000G samples are now available for the latest version of PanGenie. See links provided in the README.