eblerjana / pangenie

Pangenome-based genome inference
MIT License
108 stars 10 forks source link

pggb vcf are empty #20

Closed Overcraft90 closed 2 years ago

Overcraft90 commented 2 years ago

Hi again,

I'm having serious troubles processing the PGGB .vcf files... At first the tool pointed out to haplotypes being un-phased; this is a problem I fixed collapsing the columns for each individual haplotype in a single one for all samples using the following awk command:

awk -F '\t' '/^#/ {print;next;} {OFS="\t";for(i=j=10; i < NF; i+=2) {$j = $i"|"$(i+1); j++} NF=j-1}1' pangenome_ref_guided_GRCh38.vcf > pangenome_ref_guided_GRCh38-phased.vcf 

In fact, compared to minigraph-CACTUS, PGGB outputs a .vcf file where the haplotypes for each individual in the graph are split into two columns, this leads PanGenIe to think they are un-pahsed. Anyway, I proceed then to fix the columns headers for the respective samples, but still despite running to completion the process returns an empty .vcf.

I'm not sure what was the cause for it, and after talking with Glenn and looking up online I acknowledged I was missing some pre-processing steps, namely removing complex bubbles and decomposing nested alleles with vcfbub and vcfwave, respectively. I run the following command:

vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-phased.vcf.gz | vcfwave -L 10000 -t 16 -n | bgzip -c -@ 16 > decomposed_and_cleaned.vcf.gz

Unfortunately, I got the same result... So, going back to one of my hypotheses this could have been linked to having # in the filenames as per the PanSN standard for naming used by PGGB. I tested removing from the #CHROM column everything but the string chr(number); however, even after this attempt the result is always an empty .vcf file. I though it could have been helpful to re-run the vcfbub and vcflib commands, but this time I got the following error:

error: more sample names in header than sample fields samples: CHM13 HG002 HG00514 HG00733 HG03492 NA19240 line: chr1 418412 >11061496>11061499 T C 60.0 . AC=1;AF=1;AN=1;AT=>11061496>11061497>11061499,>11061496>11061498>11061499;NS=1;LV=0 GT .|. .|. .|. .|. .|1 thread 'main' panicked at 'called Result::unwrap() on an Err value: IoError(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })', src/main.rs:199:46 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

This is strange as I previously didn't get that, maybe because I run the pre-processing before editing with awk, I don't know for sure...

Please, if you have any clue/idea of what is happening I would really appreciate some help. I tried several approaches but all failed and I wonder whether there is something I'm missing. With that said, I'm happy to attach a screenshot of the original .vcf file if needed and I'm sorry for the long text.

Thanks in advance!

P.S. I always made sure the contigs name in the reference and the #CHROM in the .vcf were consistent in all trials

eblerjana commented 2 years ago

So you ran PanGenie and the output VCF is empty? Is there any waring/error reported in PanGenie's log output?

I guess this is rather a problem of the input VCF. PanGenie needs phased VCFs as input that contain non-overlapping records representing top-level bubbles. For the HPRC data, we filtered out all non-top level bubbles using vcfbub (see Methods section, p.65 of our preprint https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1 for details and commands) and then ran PanGenie on the resulting VCF. We also removed records in the VCF for which more than 20% of the haplotypes contain missing (".") alleles. I never used vcfwave, but I don't think running it prior to PanGenie works. I assume it produces overlapping variant alleles since it tries to decompose bubbles, and PanGenie does not work on variants represented in this way. The decomposition we did in the HPRC paper was only done for evaluation purposes. It happened after genotyping and is not related to PanGenie, but only to the structure of the Minigraph-Cactus graph/VCF that enables translating genotypes computed for bubbles to genotypes of all nested variant alleles. Note however, that this approach was specifically designed for MC and does not work for PGGB.

Overcraft90 commented 2 years ago

Hi @eblerjana and thanks a lot for the speedy reply,

To answer your first question, not major errors which I can pinpoint to something going terribly wrong; in fact, as I said the Snakemake pipe runs to completion. Attached the .log output just in case you might be able to find something I'm missing

program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences. author: Jana Ebler Files and parameters used: -c 0 -d 0 -e 3000000000 -g 1 -i /g100_work/IscrC_PanSV/2.mapping_benchmark/HG005.fastq -j 24 -k 31 -o /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenie/sample1_graph -p 0 -r /g100_work/IscrC_PanSV/pangenomes/GRCh38_p14v2.0-vcf_headers2.fna -s sample -t 24 -u 0 -v /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenome/pangenome.vcf Determine allele sequences ... Found 708 chromosome(s) from the reference file. Identified 0 variants in total from VCF-file.

Memory usage until now: 1.66526 GB

Write path segments to file: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/pangenie/sample1_graph_path_segments.fasta ... Found 0 chromosome(s) in the VCF.

Memory usage until now: 1.97328 GB

Count kmers in reads ... Histogram peaks: 19 (116849478), 279 (15679) Computed kmer abundance peak: 19 Count kmers in genome ...

Memory usage until now: 50.1948 GB

Determine unique kmers ... Warning: using 0 for determining unique kmers.

Memory usage until now: 50.1948 GB

Memory usage until now: 50.1948 GB

0 1 2 3 4 5 6 7 8 9 10 11 12

Sampled 1 subset(s) of paths each of size 13 for genotyping.

Memory usage until now: 50.1948 GB

Construct HMM and run core algorithm ... Warning: using 0 for genotyping. Write results to VCF ...

Summary

time spent reading input files: 46.8324 sec time spent counting kmers: 1579.05 sec time spent selecting paths: 0.000298384 sec time spent determining unique kmers: 0.00325644 sec total running time: 1625.89 sec total wallclock time: 1625.89 sec Total maximum memory usage: 50.1948 GB

Also, I should have mentioned I always run PanGenIe using the Snakefile provided with the tool in order to avoid exactly the problem of non-overlapping records representing top level bubbles. From my understanding the pipeline should take care of that, and then reconvert the VCF to its original format.

However, the real problem is not vcfbub, which I could run but still didn't change the outcome, but rather something you mentioned and that I realised but didn't know how to fix. I have plenty of "." in my VCF...; at the beginning, I didn't really think this could have been an issue as the raw VCF from minigraph-CACTUS could run through the PanGenIe pipeline with no pre-processing nor any "." having taken care of (and also in this case were many).

My guessing at this point is that the pggb VCF for some reason have more "." than the minigraph-CACTUS one, ultimately, triggering the threshold of 20% missing alleles and returning an empty imputation for the sample. So, my question now — as this is not easy to assess visually, hence why even though I realised it I couldn't really do much — is there any way/command to filter out those missing alleles?

Thanks in advance!

eblerjana commented 2 years ago

Ok, so the Snakemake pipeline provided automatically removes all positions from the VCF that are covered by less than 80% of the haplotypes and this apparently causes PanGenie to be run on an empty VCF in your case (since the PanGenie log says it read 0 variants from the input). But this suggests that every single record in your input VCF has more than 20% haplotypes missing, which does not seem right. How does this VCF look like? How many haplotypes are in it and why are so many of them "."? Can you share your input VCF (or a small part of it)?

Overcraft90 commented 2 years ago

@eblerjana This is exactly what is weird to me as well. My pangenome has five individuals (10 haplotypes). I assembled both with minigraph-CACTUS and pggb, but the latter has been very challenging to work with when it comes to genome inference with PanGenIe. It seems to me the VCF files generated by the two assemblers are substantially different one from the other in terms of how information is stored.

The minigraph-CACTUS VCF, however as I mentioned before, runs without problems even in the form of raw output from the assembler. Anyway, below portions of the VCF maybe you can find out some issues I'm not able to identify

head -n 1500 pangenome_ref_guided_GRCh38.vcf

Screenshot from 2022-09-21 10-04-17

tail -n 1500 pangenome_ref_guided_GRCh38.vcf

Screenshot from 2022-09-21 10-07-00

P.S. I don't think the headers are anyhow involved in the process, so I omitted them. Also, this VCF is the raw output from pggb before any pre-processing e.g. columns collapsing. The compressed version for the VCF is only 1.6GB, if you are happy I might be able to share it on DropBox

eblerjana commented 2 years ago

But you first collapsed the columns and fixed the contig names before running the PanGenie Snakemake pipeline, right? Can you please show me what the VCF looks like that you use for running the PanGenie pipeline? The VCF you shared above will not work, since there needs to be a single, phased diploid genotype for each sample and also the names of the contigs must match the names of the contigs in the provided reference genome. See the provided demo VCF for an example: https://github.com/eblerjana/pangenie/blob/master/demo/test-variants.vcf

Overcraft90 commented 2 years ago

Hi @eblerjana I apologise I actually though the raw VCF would have been more appropriate

Below the same sections for the modified version. Noted, I haven't run vcfbub on this file yet.

head -n 1500 pangenome_ref_guided_GRCh38.vcf

Screenshot from 2022-09-22 12-21-19

tail -n 1500 pangenome_ref_guided_GRCh38.vcf

Screenshot from 2022-09-22 12-21-19

eblerjana commented 2 years ago

Thanks, both of these screenshots show the same variants, I assume the second one should be the end of the file? Those variants shown here are all filtered out because of the high number of missing alleles (> 20%).

eblerjana commented 2 years ago

Also, can you send screenshots of the file after running vcfbub, so exactly the files you give as input to the Snakemake pipeline? Otherwise it's difficult to find out what the issue is here.

Overcraft90 commented 2 years ago

Sure, I would like to mention, however, that either way running the VCF through vcfbub or not running it didn't make any difference. In fact, I tried to fed both files to the Snakemake pipeline.

This is the output after vcfbub, it seems like that running solely vcfbub (and not vcfwave) hasn't produced any major change... still maybe you might able to spot something I'm missing. This is the command I run:

vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-no_spec_char-phased.vcf.gz > clean.vcf.gz

And this are the same portion for the "clean" VCF:

head -n 1500 clean.vcf.gz

Screenshot from 2022-09-23 10-11-05

tail -n 1500 clean.vcf.gz

Screenshot from 2022-09-23 10-12-23

eblerjana commented 2 years ago

Thanks. You should definitely run vcfbub before running PanGenie on the pggb VCF to remove the non-top-level bubbles (labelled as LV > 0). From the screenshots, I can't really tell what's the problem, because what's shown in the first screenshot is definitely all filtered out due to the high number of missing alleles in the haplotypes. The variants in the second screenshot do not seem to be located on any GRCh38 related contig. Would it be possible to share the VCF?

Overcraft90 commented 2 years ago

Yeah, sure. Following a link to a DB folder where I added both the compressed and uncompressed version for the VCF file I fed to PanGenIe.

https://www.dropbox.com/scl/fo/ygsr14qay35dxm0kh4hpx/h?dl=0&rlkey=muvodwg9v437g0iekmo31s5vc

However, regarding the problem of variants not located on any particular GRCh38 contig, my understanding is that this is the job done by vcfwave. I've seen some changes in the #INFO column of the VCF file after running it through the tool — I must say though I'm not completely sure, as I'm also trying to fully comprehend what modifications it makes to the VCF file structure.

Still, for the current issue I'm just sharing the files pre-processed using only vcfbub, so we can really narrow down what's going on to only one possible variable, and eventually tackle new problems one at the time.

Thanks for the help!

eblerjana commented 2 years ago

Thanks, I checked the files and noticed that your VCF is malformed. The header line of the VCF specifies 6 samples (CHM13, HG002, HG00514, HG00733, HG03492, NA19240), while there is only 5 genotype columns for the variant records:

issue

This causes bcftools norm (used in the pipeline) to output an empty VCF as lines not correctly formatted are skipped.

Overcraft90 commented 2 years ago

@eblerjana Thanks a lot for your help, and I'm sorry it ended up being such a stupid mistake. However, I must say the VCF generated by minigraph-CACTUS has exactly the same structure other than having an additional single column for CHM13.

This led me to think to two possible scenarios:

  1. The process of generating VCF out of PGGB (flag -V) is somehow arbitrary and can result in such erroneous structural format for the file
  2. I should have merged CHM13 and GRCh38 — the latter being my reference — into a single FASTA file before assembling the pangenome, and later removing the _unplacedcontigs from the reference

Glenn pointed out to a page where it was shown this exact procedure for the PGGB assembly (2.); I'm sorry I didn't mentioned that, but the procedure I followed it is not the pipeline/workflow of the HPRC in all its parts. It is more a custom-made pipeline I designed according to the feasibility for this project within the three years of my PhD program.

With that said, this means that even specifying a reference for PGGB when generating a VCF doesn't seem to be sufficient and maybe related to the fact the tool is reference-agnostic in first place? Sorry for all this questions, but I'm also trying to figure out in my mind what's going on...

Thanks again and please let me know what I should do, removing CHM13 column could be the solution? It doesn't seem right though...

Overcraft90 commented 2 years ago

UPDATE

I re-run PanGenIe on a new VCF I edited so that there are six columns: one for CHM13 (without any "|"; just 0, 1 or "."). However, I'm getting the old error related to un-phased haplotypes...

Error in rule prepare_vcf:
    jobid: 4
    output: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
    log: /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/prepare-vcf.log (check log file(s) for error message)
    conda-env: /g100/home/userexternal/mungaro0/tools/pangenie/pipelines/run-from-callset/.snakemake/conda/bbe495a30db1f3ec8af5060c16c66aaa_
    shell:
        cat /g100_work/IscrC_PanSV/VGS/pggb_ref_GRCh38p14-vcf/clean.vcf | python3 scripts/prepare-vcf.py --missing 0.2 2> /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/prepare-vcf.log 1> /g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job prepare_vcf since they might be corrupted:
/g100_work/IscrC_PanSV/2.mapping_benchmark/1.pggb/input-vcf/input-missing-removed.vcf
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-09-27T230812.739495.snakemake.log

How is this possible? I get the fact I made a mistake merging from column 10 — which is CHM13 and should have been left alone; hence why, I'm now merging from column 11 — only my diploid assemblies But still this doesn't explain the error... I wonder whether there is a specific way to collapse columns at this point.

If helps I can share this new VCF, let me know.

Overcraft90 commented 2 years ago

SOLVED

Hi @eblerjana,

I just realised a quick fix would have been to duplicate the column for CHM13 and move it next to the actual sample. This can be done in two sequential steps with awk. Once completed, I run again PanGenIe on this new VCF and it worked!

However, this rises another problem; is it correct from a procedural/theoretical point of view to add information, albeit not different in nature (in fact this simply translates into having homozygous variants – e.g. in double copy – for CHM13 which originally was haploid), to the file when this information not only is not contained in the graph structure but also could affect downstream analyses?

Let me know, maybe variants/markers called from CHM13 have to be then halved; moreover, from the total number of variants has to be subtracted half of those present in CHM13. If this holds this can possibly be a fix.

Thanks for all your help and time!

eblerjana commented 2 years ago

PanGenie is based on the assumption of having haplotype-resolved assemblies from diploid samples, therefore genotypes in the input VCF must be diploid. CHM13 is not a phased assembly and strictly speaking does not necessarily represent a real human haplotype (since again it is not phased). However, the model makes use of linkage information of haplotypes, so it should be used with "real" phased haplotype sequences. If you want to add CHM13, you can do it by simply copying it (as you suggested). The model then treats both copies as if it they were true haplotypes (which again is not really true). You don't need to do any post-processing like dividing the number of variants or subtracting variants present in CHM13.

Overcraft90 commented 2 years ago

Thanks for the clarifications, I think we can close for now.