PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Missing calls #57

Closed krukanna closed 3 years ago

krukanna commented 3 years ago
whaleyr commented 3 years ago

Where did you get the pharmcat_positions_0.8.0.vcf.gz file you are using? I suspect that's an outdated version of the file.

The v0.8 release page has a positions file (pharmcat_positions_0.8.0.vcf) that includes all three positions that you specified.

@BinglanLi This looks like it's part of the preprocessor. Can you take a look?

whaleyr commented 3 years ago

Also, I should note, the first 2 positions are for CYP2D6 which will have its allele definition information removed from the repo before the v1.0 release (it's already removed on the development branch).

The third position is a DPYD position that was added back in December of 2020 which was before the March 2021 release of v0.8.

BinglanLi commented 3 years ago

This could be an issue with the reference genome sequence fasta. Could you please share where you obtained the reference fasta?

Could you please try with this reference fasta from the NCBI The GCA 000001405.15 GRCh38 and its corresponding fasta index?

krukanna commented 3 years ago

@whaleyr I downloaded file from the 0.8.0 release page, but I still cannot find that position (there is few positions like that for example chr10:94942213, chr10:94949283, chr13:48037796, chr13:48040982, chr19:38499646). Position before and after in my vcf: chr1 97450066 rs72549303 GG G . PASS PX=DPYD:Reference[83]isG,DPYD:c.1898delC(*3)[1]isdelG GT 0/0 chr1 97450068 rs17376848 A G . PASS PX=DPYD:Reference[83]isA,DPYD:c.1896T>C[1]isG GT 0/0

In section CYP2D6 I've got about 24 missing variants from chr22. Most of them occur in pharmcat_positions_0.8.0 and in my vcf, but they are still missing in report. For example I've got position, where all samples are homozygous ref and after preprocessing in pharmcat_ready_vcf genotype in that position in single sample vcf is changed to ./.

whaleyr commented 3 years ago

Apologies, I misunderstood what you were trying to say. I thought you meant that rs72549303 wasn't in the pharmcat_positions_0.8.0.vcf file but what you're actually saying (if I'm getting this right) is that you're not seeing rs72549303 in your particular sample file after running through the pre-processor. Is that correct?

BinglanLi commented 3 years ago

@krukanna Thanks for clarification. I see the issues that are going on with the preprocessor. Could you please try with the following input instead and see if they fix the issue?

  1. For --ref_pgx_vcf, download and use this recommended reference PGx VCF. Some positions have been updated since the last release.

  2. If (1) doesn't fix the issue, try also for --ref_seq, use the reference human genome from the NCBI, i.e. the GCA 000001405.15 GRCh38 and its corresponding fasta index.

  3. Please use the PharmCAT_VCF_Preprocess.py and vcf_preprocess_utilities.py under the development branch, which has many fixes and performance updates (sorry to mention it if you're already doing so).

If you have more questions, please feel free to check out our wiki for the VCF preprocessor. If your question is not covered there, we would love to include it in the future.

krukanna commented 3 years ago

@whaleyr You understood it correct. Those positions (for example chr1:97450067, chr10:94942213, chr10:94949283, chr13:48037796, chr13:48040982, chr19:38499646) are missing in pharmcat_positions_0.8.0.vcf as well as in pharmcat_positions_0.8.0_updated_06222021.vcf. As a consequence, I see missing calls in a single sample report (no matter if I got this positions in my vcf or not)

@BinglanLi Ad 1. I run the following command - the output didn't change, still got the same missing calls python PharmCAT_VCF_Preprocess.py --input_vcf cohort.genotyped.all.sort.removed.filtered.vcf.gz --ref_seq Homo_sapiens_assembly38.fasta --ref_pgx_vcf pharmcat_positions_0.8.0_updated_06222021.vcf.gz --output_folder Pharmcat/ Ad 2. Nothing changed, because fasta file is used only to normalization, my problem is when filter vcf to positions from pharmcat_positions_0.8.0.vcf Ad 3. I tested the preprocessing scripts from the development branch, there are indeed some nice fixes, but I suspect that my problem is related to a different issue. I see missing calls in report in position, which are absent in pharmcat_positions_0.8.0.vcf. Please answer if maybe a new version of the jar file will solve this problem.

BinglanLi commented 3 years ago

Hi @krukanna, thank you for the detailed listing of positions and your patience. The latest release of PharmCAT v1.0.0 should take care of the issue related to this case of missing calls that you see.

I have tested using a mock file that contains all the PGx positions, including positions for rs72549303, rs1304490498, rs9332131, rs746071566, rs1457579126, and rs121918596 that correspond to your listing of missing calls. All the positions in the mock file was set to alternative alleles; and went through preprocessing and PharmCAT. All the positions were present in the final report html.

I'm closing the issue but feel free to re-open by leaving a comment if you still run into the same issue.