Closed Overcraft90 closed 2 years ago
So you ran PanGenie and the output VCF is empty? Is there any waring/error reported in PanGenie's log output?
I guess this is rather a problem of the input VCF. PanGenie needs phased VCFs as input that contain non-overlapping records representing top-level bubbles. For the HPRC data, we filtered out all non-top level bubbles using vcfbub
(see Methods section, p.65 of our preprint https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1 for details and commands) and then ran PanGenie on the resulting VCF. We also removed records in the VCF for which more than 20% of the haplotypes contain missing (".") alleles. I never used vcfwave
, but I don't think running it prior to PanGenie works. I assume it produces overlapping variant alleles since it tries to decompose bubbles, and PanGenie does not work on variants represented in this way. The decomposition we did in the HPRC paper was only done for evaluation purposes. It happened after genotyping and is not related to PanGenie, but only to the structure of the Minigraph-Cactus graph/VCF that enables translating genotypes computed for bubbles to genotypes of all nested variant alleles. Note however, that this approach was specifically designed for MC and does not work for PGGB.
Hi @eblerjana and thanks a lot for the speedy reply,
To answer your first question, not major errors which I can pinpoint to something going terribly wrong; in fact, as I said the Snakemake
pipe runs to completion. Attached the .log output just in case you might be able to find something I'm missing
program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences. author: Jana Ebler Files and parameters used: -c 0 -d 0 -e 3000000000 -g 1 -i /g100_work/IscrC_PanSV/2.mapping_benchmark/HG005.fastq -j 24 -k 31 -o /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenie/sample1_graph -p 0 -r /g100_work/IscrC_PanSV/pangenomes/GRCh38_p14v2.0-vcf_headers2.fna -s sample -t 24 -u 0 -v /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenome/pangenome.vcf Determine allele sequences ... Found 708 chromosome(s) from the reference file. Identified 0 variants in total from VCF-file.
Memory usage until now: 1.66526 GB
Write path segments to file: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenie/sample1_graph_path_segments.fasta ... Found 0 chromosome(s) in the VCF.
Memory usage until now: 1.97328 GB
Count kmers in reads ... Histogram peaks: 19 (116849478), 279 (15679) Computed kmer abundance peak: 19 Count kmers in genome ...
Memory usage until now: 50.1948 GB
Determine unique kmers ... Warning: using 0 for determining unique kmers.
Memory usage until now: 50.1948 GB
Memory usage until now: 50.1948 GB
0 1 2 3 4 5 6 7 8 9 10 11 12
Sampled 1 subset(s) of paths each of size 13 for genotyping.
Memory usage until now: 50.1948 GB
Construct HMM and run core algorithm ... Warning: using 0 for genotyping. Write results to VCF ...
Summary
time spent reading input files: 46.8324 sec time spent counting kmers: 1579.05 sec time spent selecting paths: 0.000298384 sec time spent determining unique kmers: 0.00325644 sec total running time: 1625.89 sec total wallclock time: 1625.89 sec Total maximum memory usage: 50.1948 GB
Also, I should have mentioned I always run PanGenIe
using the Snakefile provided with the tool in order to avoid exactly the problem of non-overlapping records representing top level bubbles. From my understanding the pipeline should take care of that, and then reconvert the VCF to its original format.
However, the real problem is not vcfbub
, which I could run but still didn't change the outcome, but rather something you mentioned and that I realised but didn't know how to fix. I have plenty of "." in my VCF...; at the beginning, I didn't really think this could have been an issue as the raw VCF from minigraph-CACTUS
could run through the PanGenIe
pipeline with no pre-processing nor any "." having taken care of (and also in this case were many).
My guessing at this point is that the pggb
VCF for some reason have more "." than the minigraph-CACTUS
one, ultimately, triggering the threshold of 20% missing alleles and returning an empty imputation for the sample. So, my question now — as this is not easy to assess visually, hence why even though I realised it I couldn't really do much — is there any way/command to filter out those missing alleles?
Thanks in advance!
Ok, so the Snakemake pipeline provided automatically removes all positions from the VCF that are covered by less than 80% of the haplotypes and this apparently causes PanGenie to be run on an empty VCF in your case (since the PanGenie log says it read 0 variants from the input). But this suggests that every single record in your input VCF has more than 20% haplotypes missing, which does not seem right. How does this VCF look like? How many haplotypes are in it and why are so many of them "."? Can you share your input VCF (or a small part of it)?
@eblerjana This is exactly what is weird to me as well.
My pangenome has five individuals (10 haplotypes). I assembled both with minigraph-CACTUS
and pggb
, but the latter has been very challenging to work with when it comes to genome inference with PanGenIe
. It seems to me the VCF files generated by the two assemblers are substantially different one from the other in terms of how information is stored.
The minigraph-CACTUS
VCF, however as I mentioned before, runs without problems even in the form of raw output from the assembler. Anyway, below portions of the VCF maybe you can find out some issues I'm not able to identify
head -n 1500 pangenome_ref_guided_GRCh38.vcf
tail -n 1500 pangenome_ref_guided_GRCh38.vcf
P.S. I don't think the headers are anyhow involved in the process, so I omitted them. Also, this VCF is the raw output from pggb
before any pre-processing e.g. columns collapsing. The compressed version for the VCF is only 1.6GB, if you are happy I might be able to share it on DropBox
But you first collapsed the columns and fixed the contig names before running the PanGenie Snakemake pipeline, right? Can you please show me what the VCF looks like that you use for running the PanGenie pipeline? The VCF you shared above will not work, since there needs to be a single, phased diploid genotype for each sample and also the names of the contigs must match the names of the contigs in the provided reference genome. See the provided demo VCF for an example: https://github.com/eblerjana/pangenie/blob/master/demo/test-variants.vcf
Hi @eblerjana I apologise I actually though the raw VCF would have been more appropriate
Below the same sections for the modified version. Noted, I haven't run vcfbub
on this file yet.
head -n 1500 pangenome_ref_guided_GRCh38.vcf
tail -n 1500 pangenome_ref_guided_GRCh38.vcf
Thanks, both of these screenshots show the same variants, I assume the second one should be the end of the file? Those variants shown here are all filtered out because of the high number of missing alleles (> 20%).
Also, can you send screenshots of the file after running vcfbub, so exactly the files you give as input to the Snakemake pipeline? Otherwise it's difficult to find out what the issue is here.
Sure, I would like to mention, however, that either way running the VCF through vcfbub
or not running it didn't make any difference. In fact, I tried to fed both files to the Snakemake pipeline.
This is the output after vcfbub
, it seems like that running solely vcfbub
(and not vcfwave
) hasn't produced any major change... still maybe you might able to spot something I'm missing.
This is the command I run:
vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-no_spec_char-phased.vcf.gz > clean.vcf.gz
And this are the same portion for the "clean" VCF:
head -n 1500 clean.vcf.gz
tail -n 1500 clean.vcf.gz
Thanks. You should definitely run vcfbub before running PanGenie on the pggb VCF to remove the non-top-level bubbles (labelled as LV > 0). From the screenshots, I can't really tell what's the problem, because what's shown in the first screenshot is definitely all filtered out due to the high number of missing alleles in the haplotypes. The variants in the second screenshot do not seem to be located on any GRCh38 related contig. Would it be possible to share the VCF?
Yeah, sure.
Following a link to a DB folder where I added both the compressed and uncompressed version for the VCF file I fed to PanGenIe
.
https://www.dropbox.com/scl/fo/ygsr14qay35dxm0kh4hpx/h?dl=0&rlkey=muvodwg9v437g0iekmo31s5vc
However, regarding the problem of variants not located on any particular GRCh38 contig, my understanding is that this is the job done by vcfwave
. I've seen some changes in the #INFO column of the VCF file after running it through the tool — I must say though I'm not completely sure, as I'm also trying to fully comprehend what modifications it makes to the VCF file structure.
Still, for the current issue I'm just sharing the files pre-processed using only vcfbub
, so we can really narrow down what's going on to only one possible variable, and eventually tackle new problems one at the time.
Thanks for the help!
Thanks, I checked the files and noticed that your VCF is malformed. The header line of the VCF specifies 6 samples (CHM13, HG002, HG00514, HG00733, HG03492, NA19240), while there is only 5 genotype columns for the variant records:
This causes bcftools norm
(used in the pipeline) to output an empty VCF as lines not correctly formatted are skipped.
@eblerjana Thanks a lot for your help, and I'm sorry it ended up being such a stupid mistake. However, I must say the VCF generated by minigraph-CACTUS
has exactly the same structure other than having an additional single column for CHM13.
This led me to think to two possible scenarios:
PGGB
(flag -V) is somehow arbitrary and can result in such erroneous structural format for the fileGlenn pointed out to a page where it was shown this exact procedure for the PGGB
assembly (2.); I'm sorry I didn't mentioned that, but the procedure I followed it is not the pipeline/workflow of the HPRC in all its parts. It is more a custom-made pipeline I designed according to the feasibility for this project within the three years of my PhD program.
With that said, this means that even specifying a reference for PGGB
when generating a VCF doesn't seem to be sufficient and maybe related to the fact the tool is reference-agnostic in first place?
Sorry for all this questions, but I'm also trying to figure out in my mind what's going on...
Thanks again and please let me know what I should do, removing CHM13 column could be the solution? It doesn't seem right though...
I re-run PanGenIe
on a new VCF I edited so that there are six columns: one for CHM13 (without any "|"; just 0, 1 or "."). However, I'm getting the old error related to un-phased haplotypes...
Error in rule prepare_vcf:
jobid: 4
output: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
log: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/prepare-vcf.log (check log file(s) for error message)
conda-env: /g100/home/userexternal/mungaro0/tools/pangenie/pipelines/run-from-callset/.snakemake/conda/bbe495a30db1f3ec8af5060c16c66aaa_
shell:
cat /g100_work/IscrC_PanSV/VGS/pggb_ref_GRCh38p14-vcf/clean.vcf | python3 scripts/prepare-vcf.py --missing 0.2 2> /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/prepare-vcf.log 1> /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job prepare_vcf since they might be corrupted:
/g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-09-27T230812.739495.snakemake.log
How is this possible? I get the fact I made a mistake merging from column 10 — which is CHM13 and should have been left alone; hence why, I'm now merging from column 11 — only my diploid assemblies But still this doesn't explain the error... I wonder whether there is a specific way to collapse columns at this point.
If helps I can share this new VCF, let me know.
Hi @eblerjana,
I just realised a quick fix would have been to duplicate the column for CHM13 and move it next to the actual sample. This can be done in two sequential steps with awk
. Once completed, I run again PanGenIe
on this new VCF and it worked!
However, this rises another problem; is it correct from a procedural/theoretical point of view to add information, albeit not different in nature (in fact this simply translates into having homozygous variants – e.g. in double copy – for CHM13 which originally was haploid), to the file when this information not only is not contained in the graph structure but also could affect downstream analyses?
Let me know, maybe variants/markers called from CHM13 have to be then halved; moreover, from the total number of variants has to be subtracted half of those present in CHM13. If this holds this can possibly be a fix.
Thanks for all your help and time!
PanGenie is based on the assumption of having haplotype-resolved assemblies from diploid samples, therefore genotypes in the input VCF must be diploid. CHM13 is not a phased assembly and strictly speaking does not necessarily represent a real human haplotype (since again it is not phased). However, the model makes use of linkage information of haplotypes, so it should be used with "real" phased haplotype sequences. If you want to add CHM13, you can do it by simply copying it (as you suggested). The model then treats both copies as if it they were true haplotypes (which again is not really true). You don't need to do any post-processing like dividing the number of variants or subtracting variants present in CHM13.
Thanks for the clarifications, I think we can close for now.
Hi again,
I'm having serious troubles processing the
PGGB
.vcf files... At first the tool pointed out to haplotypes being un-phased; this is a problem I fixed collapsing the columns for each individual haplotype in a single one for all samples using the followingawk
command:In fact, compared to
minigraph-CACTUS
,PGGB
outputs a .vcf file where the haplotypes for each individual in the graph are split into two columns, this leadsPanGenIe
to think they are un-pahsed. Anyway, I proceed then to fix the columns headers for the respective samples, but still despite running to completion the process returns an empty .vcf.I'm not sure what was the cause for it, and after talking with Glenn and looking up online I acknowledged I was missing some pre-processing steps, namely removing complex bubbles and decomposing nested alleles with
vcfbub
andvcfwave
, respectively. I run the following command:Unfortunately, I got the same result... So, going back to one of my hypotheses this could have been linked to having
#
in the filenames as per thePanSN
standard for naming used byPGGB
. I tested removing from the #CHROM column everything but the string chr(number); however, even after this attempt the result is always an empty .vcf file. I though it could have been helpful to re-run thevcfbub
andvcflib
commands, but this time I got the following error:This is strange as I previously didn't get that, maybe because I run the pre-processing before editing with
awk
, I don't know for sure...Please, if you have any clue/idea of what is happening I would really appreciate some help. I tried several approaches but all failed and I wonder whether there is something I'm missing. With that said, I'm happy to attach a screenshot of the original .vcf file if needed and I'm sorry for the long text.
Thanks in advance!
P.S. I always made sure the contigs name in the reference and the #CHROM in the .vcf were consistent in all trials