PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Question about VCF error and Help with preprocessor error #116

Closed Pharmacogenetecist closed 2 years ago

Pharmacogenetecist commented 2 years ago

We utilize the Illumina Dragen pipeline for processing our WGS. I have noticed the following error when running PHARMCAT 2.0 and prior versions.

java.lang.IllegalArgumentException: CHROM column "HLA-DQA1*05:03" contains whitespace or colons at org.pharmgkb.parser.vcf.model.VcfPosition.(VcfPosition.java:66) at org.pharmgkb.parser.vcf.VcfParser.parseNextLine(VcfParser.java:238) at org.pharmgkb.parser.vcf.VcfParser.parse(VcfParser.java:127) at org.pharmgkb.pharmcat.haplotype.VcfReader.read(VcfReader.java:138) at org.pharmgkb.pharmcat.haplotype.VcfReader.(VcfReader.java:65) at org.pharmgkb.pharmcat.haplotype.NamedAlleleMatcher.buildVcfReader(NamedAlleleMatcher.java:161) at org.pharmgkb.pharmcat.haplotype.NamedAlleleMatcher.call(NamedAlleleMatcher.java:212) at org.pharmgkb.pharmcat.PharmCAT.execute(PharmCAT.java:322) at org.pharmgkb.pharmcat.PharmCAT.main(PharmCAT.java:194)

I was hoping you could explain this error and if there was an easy fix my end.

I tried to preprocess the VCF, followed the instruction to get all the pre-requistes, but get the following error;

python3 PharmCAT_VCF_Preprocess.py -vcf /pharmcatDATAin/NA12878.gvcf.gz -o /pharmcatDATAin/PHARMCAT_Preprocessed/ Traceback (most recent call last): File PharmCAT_VCF_Preprocess.py, line 227, in run(args) File PharmCAT_VCF_Preprocess.py, line 44, in run tool_version = re.search(r'Version: (\d.\d+)*', version_message).group(1) AttributeError: 'NoneType' object has no attribute 'group'

BinglanLi commented 2 years ago
  1. There seems to be something wrong with a line in your VCF where the CHROM column is HLA-DQA1*05:03. Could you please share your file or an example?

  2. What do you see when you type bgzip -h in your computing environment?

Pharmacogenetecist commented 2 years ago

HLA-DQA1*05:03

The file is extremely large to be able to share, I will try subset where the error is coming from.

For bgzip -h bgzip -h

Version: 1.9 Usage: bgzip [OPTIONS] [FILE] ... Options: -b, --offset INT decompress at virtual file pointer (0-based uncompressed offset) -c, --stdout write on standard output, keep original files unchanged -d, --decompress decompress -f, --force overwrite files without asking -h, --help give this help -i, --index compress and create BGZF index -I, --index-name FILE name of BGZF index file [file.gz.gzi] -l, --compress-level INT Compression level to use when compressing; 0 to 9, or -1 for default [-1] -r, --reindex (re)index compressed file -g, --rebgzip use an index file to bgzip a file -s, --size INT decompress INT bytes (uncompressed size) -@, --threads INT number of compression threads to use [1] -t, --test test integrity of compressed file

markwoon commented 2 years ago

@Pharmacogenetecist Just a single line will do, but assuming the error is accurate, it looks like you're providing HLA-DQA1*05:03 as a value for the chromosome value in the VCF, which is invalid...

Pharmacogenetecist commented 2 years ago

HLA-DQA1*05:03

Searching for "HLA-DQA" in the vcf I get the following - attached.

HLA.txt

I am guessing this may be a result of the DRAGEN pipeline we use which assesses HLA. I was hoping using the pre-processing tool would "fix" this, but seems like I have some issues there as well.

Appreciate your help.

markwoon commented 2 years ago

We definitely want to track down the problem you're having with the preprocessor, and we'll be getting back to you on that ASAP.

However, even the preprocessor won't be able to work with this.

I'm not familiar enough with DRAGEN to suggest anything concrete, but I would hazard a guess that there is some configuration somewhere then will get it to provide proper chromosomal locations.

One of the requirements for PharmCAT is that the VCF is based on GRCh38. Without a common understanding of what your data represents, there's no way for PharmCAT to make any calls.

Pharmacogenetecist commented 2 years ago

We definitely want to track down the problem you're having with the preprocessor, and we'll be getting back to you on that ASAP.

However, even the preprocessor won't be able to work with this.

I'm not familiar enough with DRAGEN to suggest anything concrete, but I would hazard a guess that there is some configuration somewhere then will get it to provide proper chromosomal locations.

One of the requirements for PharmCAT is that the VCF is based on GRCh38. Without a common understanding of what your data represents, there's no way for PharmCAT to make any calls.

Thank you, I appreciate the help. I re-tried a different sample as the above was a gvcf file. With a standard vcf file I get the following error. java -jar "/user/PharmCAT-2.0.0/pharmcat-2.0.0-all.jar" -vcf "/user/DATA/test38.vcf" -o "/home/user/DATA/" java.lang.IllegalArgumentException: CHROM column "HLA-DRB1*03:01:01:01" contains whitespace or colons at org.pharmgkb.parser.vcf.model.VcfPosition.(VcfPosition.java:66) at org.pharmgkb.parser.vcf.VcfParser.parseNextLine(VcfParser.java:238) at org.pharmgkb.parser.vcf.VcfParser.parse(VcfParser.java:127) at org.pharmgkb.pharmcat.haplotype.VcfReader.read(VcfReader.java:138) at org.pharmgkb.pharmcat.haplotype.VcfReader.(VcfReader.java:65) at org.pharmgkb.pharmcat.haplotype.NamedAlleleMatcher.buildVcfReader(NamedAlleleMatcher.java:161) at org.pharmgkb.pharmcat.haplotype.NamedAlleleMatcher.call(NamedAlleleMatcher.java:212) at org.pharmgkb.pharmcat.PharmCAT.execute(PharmCAT.java:322) at org.pharmgkb.pharmcat.PharmCAT.main(PharmCAT.java:194)

I suppose if I tried to subset my VCF files for chr 1 - 22, x, y and MT, I could see if that helps.

Pharmacogenetecist commented 2 years ago

I tried the new version of the pre-processor this morning and get the following: Please use bcftools 1.16 or higher bcftools --version bcftools 1.9 Using htslib 1.9

BinglanLi commented 2 years ago

@Pharmacogenetecist bcftools 1.9 is a much lower version (seven versions older) than 1.16. You can download the latest tools here http://www.htslib.org/download/ or load an up-to-date bcftools in your computing environment if it's already installed.

Pharmacogenetecist commented 2 years ago

@Pharmacogenetecist bcftools 1.9 is a much lower version (seven versions older) than 1.16. You can download the latest tools here http://www.htslib.org/download/ or load an up-to-date bcftools in your computing environment if it's already installed.

Thank you, I guess I need to go back to grade school to learn my numbers ;)

I am getting an issue on updating bcftools to 1.16 bcftools: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory

I tried making a new conda environment and installing - bcftools=1.16 and - htslib=1.16 but when I run bcftools --version, I get the above error

RESOLVED error by installing - https://anaconda.org/conda-forge/gsl

Will test pre-processor and report back.

Update, I tried 3 different gzipped vcf files.

errors:

  1. vcf file 1 Traceback (most recent call last): File "/user/PharmCAT-2.0.1/preprocessor/PharmCAT_VCF_Preprocess.py", line 245, in run(args) File "/user/PharmCAT-2.0.1/preprocessor/PharmCAT_VCF_Preprocess.py", line 145, in run for line in file: File "/user/anaconda3/envs/PCAT/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I tool the original vcf file 1 and bgzipped it instead of gzip - this time I get an error similar to vcf files 2 and 3 below: Normalizing VCF [E::faidx_adjust_position] The sequence "chrX" was not found faidx_fetch_seq failed at chrX:154531869

  1. vcf file 2 Preparing the reference PGx VCF [E::faidx_adjust_position] The sequence "chrX" was not found faidx_fetch_seq failed at chrX:154532046

  2. vcf file 3 Normalizing VCF [E::faidx_adjust_position] The sequence "chrX" was not found faidx_fetch_seq failed at chrX:154532293

markwoon commented 2 years ago

I think we're going to need more information here. Can you provide the VCF headers and the lines on chrX:154532293?

markwoon commented 2 years ago

I just released a v2.0.2. Could you please re-try with this?

If you've already donwloaded the reference fasta files, you'll need to delete them (look for reference.fna.*). We needed to fix something in them and you'll need to download the updated versions (the VCF preprocessor does this automatically).

Pharmacogenetecist commented 2 years ago

@markwoon @BinglanLi I really appreciate all your help. Headers etc, for some reason I can't ssh into our server today, I will test this out and get back to you as soon as I can.

Pharmacogenetecist commented 2 years ago

I just released a v2.0.2. Could you please re-try with this?

If you've already donwloaded the reference fasta files, you'll need to delete them (look for reference.fna.*). We needed to fix something in them and you'll need to download the updated versions (the VCF preprocessor does this automatically).

Heads:

VCF file 1

fileformat=VCFv4.2

ALT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

For VCF file 2:

fileformat=VCFv4.2

fileDate=20160824

CL=vcffilter -i - -o - --javascript "function record() {HG001.PS=\".\";}"

contig=

contig=

contig=

contig=

contig=

contig=

contig=

I removed 2.0.1 and got the 2.0.2. For both files I still get the same chrX error using 2.0.2

Pharmacogenetecist commented 2 years ago

FYI - the test VCF file I am using (VCF2) is from here: https://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/

BinglanLi commented 2 years ago

Just to confirm, did you remove the previously downloaded reference.fna.* before rerunning the VCF preprocessor v2.0.2? I replicated the error that you had. The warning went away after I removed the previously downloaded reference genome sequence and used the new VCF preprocessor.

Pharmacogenetecist commented 2 years ago

Just to confirm, did you remove the previously downloaded reference.fna.* before rerunning the VCF preprocessor v2.0.2? I replicated the error that you had. The warning went away after I removed the previously downloaded reference genome sequence and used the new VCF preprocessor.

I just retried it after deleting the reference file and get the following error.

Traceback (most recent call last): File "/Pharmcatpreprocessor2.0.2/PharmCAT_VCF_Preprocess.py", line 245, in run(args) File "/Pharmcatpreprocessor2.0.2/PharmCAT_VCF_Preprocess.py", line 179, in run vcf_normalized_pgx_only = util.filter_pgx_variants(bcftools_path, bgzip_path, vcf_normalized, reference_genome, File "/Pharmcatpreprocessor2.0.2/vcf_preprocess_utilities.py", line 515, in filter_pgx_variants updated_info = ';'.join(ref_info.append(fields[7])) if fields[7] != '.' else ';'.join(ref_info) TypeError: can only join an iterable

markwoon commented 2 years ago

This has been fixed in 2.0.3.

Feel free to re-open if you see another problem.