aschluter / ClinPrior

MIT License
7 stars 1 forks source link

Error in readVCF with test sample AND with my own sample #2

Closed manubonetto closed 7 months ago

manubonetto commented 8 months ago

Hi! Yesterday I installed ClinPrior following your README instructions, and the installation apparently went well. However, when following your test instructions I am faced with 2 different issues.

When using the test samples you provided, with the code vcfFile = paste(system.file("extdata/example", package = "ClinPrior"),"HG001_GRCh37_1_22_v4.2.1_benchmark.vep01.KCNQ2Met546Thr.vcf.gz",sep="/") variants <- read.vcfR(vcfFile) variantsFiltered <- readVCF(sampleName = "HG001", variants=variants, assembly= "assembly37")

read.vcfR runs correctly and successfully processes the variants, but readVCF returns Error in file(file, "rt") : invalid 'description' argument when trying to create the variantsFiltered object.

Then, I attempted to run the same commands, but using my own sample, with the code vcfFile = paste("../../test_data/CMG336_TM-001-399000034101.hard-filtered.exoma.hg38.noSQ.SORTED.vcf.gz") variants <- read.vcfR(vcfFile) variantsFiltered <- readVCF(sampleName = "CMG336", variants=variants, assembly= "assembly38")

Again, read.vcfR runs correctly and successfully processes the variants, but readVCF returns a different error: Error in queryMETA(variants, "\\bCSQ\\b")[[1]] : subscript out of bounds when trying to create the variantsFiltered object.

Is there any way you could help me solve this? Thanks in advance!

aschluter commented 8 months ago

Hi manubonetto,

Thank you very much to report these issues.

  1. first issue

Can you check that you have the assembly37 folder in the ClinPrior folder? If you run this command in R-shell:

dir(system.file("extdata", package = "ClinPrior"))

You should have the following folders "assembly37", "assembly38", "example", "maxentpy". If not, you should download them:

dir.create(paste(system.file("extdata", package = "ClinPrior"),"/assembly37",sep=""))
library(piggyback)

pb_download(repo = "aschluter/ClinPrior",
            tag = "v2.0-assemblyGRCh37",
            dest = system.file("extdata/assembly37", package = "ClinPrior"))
dir.create(paste(system.file("extdata", package = "ClinPrior"),"/assembly38",sep=""))
library(piggyback)
pb_download(repo = "aschluter/ClinPrior",
            tag = "v2.0-assemblyGRCh38",
            dest = system.file("extdata/assembly38", package = "ClinPrior"))
  1. second issue Has your vcf file CMG336_TM-001-399000034101.hard-filtered.exoma.hg38.noSQ.SORTED.vcf.gz been annotated by the Ensembl Variant Effect Predictor (VEP) (https://www.ensembl.org/info/docs/tools/vep/index.html)?

CSQ indicates the starting point of the variant consequences annotated by VEP.

Agatha

manubonetto commented 8 months ago

Hi, Agatha! Thank you for the response! The first issue was solved with downloading the folders that were missing. And the second one was solved by annotating the VCF with VEP. However, I have one last question: is it possible to analyze a patient's VCF associated only to one HPO term? I have tried to run proteinScore giving only one HPO term but it won't run unless I provide at least 2.

Again, thank you!

aschluter commented 8 months ago

Hi manubonetto,

Thanks for your feedback! To take advantage of the phenotypic prioritisation process, we recommend running ClinPrior with at least 7-10 specific HPOs. Otherwise the phenotypic prioritisation may be random and not reliable. However, if you don't have any clinical information and you want to run ClinPrior to get the variant prioritisation process, you can add another HPO term. For example, you can add the direct HPO ancestor of the unique HPO you have. The hierarchical HPO structure can be found here: https://hpo.jax.org/app/.

Agatha

manubonetto commented 7 months ago

Hi! I've been having a different issue now. I am testing clinPrior with 4 different samples, using these commands: HPOpatient = c("HP:0011675","HP:0030956") Y<-proteinScore(HPOpatient) ClinPriorGeneScore<-MatrixPropagation(Y,alpha=0.2) vcfFile = "/home/manuela/projects/Prj_CNPq_VariantPrioritization/test_data/annotated_vcf/CMG832_TM-001-399000084103.hard-filtered.exoma.hg38.noSQ.sorted.topmed.gAD_genomes.gAD_exomes.abraom.dbnsfp.vep.vcf.gz" variants <- read.vcfR(vcfFile) resultCMG832 = priorBestVariantVcfR(variants = variants, sampleName = "TM-001-399000084103", GlobalPhenotypicScore = ClinPriorGeneScore, assembly= "assembly38", processors = 3)

For 3 of the samples I'm using priorBestVariantVcfR works just fine. However, for the 4th one (the one referred to in the code above) I am getting the following error: Error in phaseVariants[posPhaseVariants] <- regmatches(gt[selectedVariants, : replacement has length zero

What could be causing this? I don't see why this is happening only with one of my VCFs.

Thank you in advance!

aschluter commented 7 months ago

Hi manubonetto,

Can you check that the sample name "TM-001-399000084103" is the same as in the gt field of the vcf file? Please do the following:

gt = variants@gt
colnames(gt)

Can you see "TM-001-399000084103" ?

Agatha

manubonetto commented 7 months ago

Hi, thanks for the rapid response! Yes, it returns the correct code: > colnames(gt) [1] "FORMAT" "TM-001-399000084103"

Manuela

aschluter commented 7 months ago

OK, it seems that you have phased variants (two or more variants in the same chromosome) in the vcf. ClinPrior recognises them in the gt field with a genotype 0|1 instead of 0/1. Can you show me the full gt of a variant with 0|1? Usually the gt is accompanied by the other allele that is in phase, e.g: 0|1:34,12:46:99:0|1:119469901_G_A:391,0,1392:119469901 is the full gt of a variant that is in phase with 119469901_G_A.

Please do the following:

posPhaseVariants<-grep("\\|",gt[1:dim(variants)[1],2],perl =TRUE)
gt[1:dim(variants)[1],2][posPhaseVariants][1:10]

These are the top 10 phased variants. What do you get?

Agatha

manubonetto commented 7 months ago

Sure! This is what I get: [1] "0|1:35,28:0.444:63:19,12:16,16:48:84,0,50:48.985,7.6623e-05,53:0,34.77,37.77:18,17,12,16:20,15,18,10:859685" [2] "0|1:34,28:0.452:62:18,12:16,16:48:84,0,50:49.248,7.3258e-05,53:0,34.77,37.77:18,16,12,16:20,14,18,10:859685" [3] "0|1:34,30:0.469:64:18,14:16,16:48:84,0,50:49.664,6.8598e-05,53:0,34.77,37.77:13,21,12,18:21,13,18,12:859685" [4] "0|1:14,13:0.481:27:8,7:6,6:48:79,0,50:49.878,6.6527e-05,53:0,29,32:2,12,2,11:6,8,6,7:899919"
[5] "1|0:14,11:0.44:25:7,6:7,5:42:77,0,50:42.043,0.00029304,53:0,34.77,37.77:3,11,0,11:6,8,5,6:899919"
[6] "0|1:11,13:0.542:24:6,6:5,7:46:79,0,45:50,0.00010665,48.363:0,29,32:0,11,2,11:5,6,7,6:899919"
[7] "1|0:15,10:0.4:25:8,5:7,5:41:76,0,50:41.018,0.00036527,53:0,34.77,37.77:3,12,0,10:6,9,5,5:899919"
[8] "1|0:14,11:0.44:25:7,6:7,5:35:70,0,50:34.908,0.0014247,53.001:0,34.77,37.77:3,11,0,11:6,8,5,6:899919"
[9] "1|0:14,8:0.364:22:7,4:7,4:46:82,0,50:46.885,0.00011079,53:0,34.77,37.77:2,12,1,7:7,7,5,3:899919"
[10] "0|1:17,23:0.575:40:8,9:9,14:48:85,0,48:50,7.6364e-05,51.187:0,34.77,37.77:14,3,19,4:7,10,7,16:1007203"

aschluter commented 7 months ago

Hi, The phased variants in your vcf only specify the chromosomal position, not the nucleotides. I have changed the code so that if the nucleotides are not specified in the phased variants, ClinPrior cannot report which variants are in phase. You can now reinstall ClinPrior. You can keep the assembly37 and assembly38 folders from before the installation if you don't want to download them again. Now it would work for your fourth sample. Let me know!

devtools::install_github("aschluter/ClinPrior")
library(ClinPrior)

Best,

Agatha

manubonetto commented 7 months ago

Hi, Agatha!

Thank you so much, everything is working perfectly now.

Manuela