Open mirunabarbu1 opened 7 months ago
Hello, Did you use the dockerfile from the GitHub repository? It looks like for some reason numpy is not installed in your image for the single input. For the vcf, it is not reading in your vcf file for some reason- it does only work with hg38.
Lindsay
On Thu, Mar 14, 2024 at 1:20 PM mirunabarbu1 @.***> wrote:
Hello
I hope you are well. I am having an issue utilising RegVar as part of the Docker image here. I downloaded the dockerfile on 13 March 2024 and built the container with Docker, then built the image with Singularity. I also downloaded all files in /extdata. I tried running the CharacterizeVariants function (supplying a VCF file). I've also tried pointing to the specific input and output paths.
CharacterizeVariants
CharacterizeVariants("grch37_variants.vcf", "/", "/") gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten Error in CharacterizeVariants("grch37_variants.vcf", "~/", : object 'vcf' not found
I also tried running the single input function:
CharacterizeVariants_single_input
CharacterizeVariants_single_input("./") Enter 3'UTR single nucleotide variant chromosome number: X Enter variant position in 1-based coordinates: 154021596 Enter hg38 reference base (capitalized) at variant position: G Enter variant base (capitalized): A Enter a name for this variant (no spaces): testvar1 gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Warning: the index file is older than the FASTA file. Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'numpy' Run reticulate::py_last_error() for details. In addition: Warning message: In rm(miR_predictions) : object 'miR_predictions' not found
I've checked my extdata directory and I've got all files in there.
Please let me know if you need further information.
Could you also let me know if this works on variants on genome build GRCh37, or is only applicable to GRCh38? Thank you!
Best wishes Miruna
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDG2JY5IQLO7TQ6TFTTYYHL3RAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOBYGU4DIOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hello! Thank you for the quick reply!
Yes, I downloaded the dockerfile form the GitHub repo and built it based on this (this was yesterday - mentioning this as I've seen you've last made changed to this file last month). I'm not sure why numpy wouldn't be installed if it's part of the dockerfile.
For the VCF file, this is what the header looks like (I am trialling this on a single chromosome at the moment). the chr, pos, ref, and alt columns are filled with variant information, however everything else is a ".". Would this be an issue?
Thank you for clarifying on the GRCh38 question.
Best wishes Miruna
That shouldn't be an issue for the VCF; I'm not sure why it's not loading correctly with datatable. Can you confirm that the dockerfile loaded without issues? You could try directly installing numpy in the docker and rerunning the single variant function.
Lindsay
On Thu, Mar 14, 2024 at 1:35 PM mirunabarbu1 @.***> wrote:
Hello! Thank you for the quick reply!
Yes, I downloaded the dockerfile form the GitHub repo and built it based on this (this was yesterday - mentioning this as I've seen you've last made changed to this file last month). I'm not sure why numpy wouldn't be installed if it's part of the dockerfile.
For the VCF file, this is what the header looks like (I am trialling this on a single chromosome at the moment). the chr, pos, ref, and alt columns are filled with variant information, however everything else is a ".". Would this be an issue?
fileformat=VCFv4.2
contig=
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Thank you for clarifying on the GRCh38 question.
Best wishes Miruna
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-1997988953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDH4VAOSO5T54UOOROTYYHNXVAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXHE4DQOJVGM . You are receiving this because you commented.Message ID: @.***>
Hello,
I have installed both numpy directly in the docker and re-ran the function. I now get the following error (I am trying a variant that you have confirmed to be in your list in a previous issue on GitHub). Would you happen to know what the issue is?
Enter 3'UTR single nucleotide variant chromosome number: 1
Enter variant position in 1-based coordinates: 90916206
Enter hg38 reference base (capitalized) at variant position: C
Enter variant base (capitalized): T
Enter a name for this variant (no spaces): test
gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n
not overwritten
gzip: motifs2 already exists; do you wish to overwrite (y or n)? n
not overwritten
[1] "incorporating eCLIP"
[1] "incorporating eQTLS"
[1] "incorporating GWAS"
[1] "incorporating microRNAs"
[1] "incorporating RBP motifs"
processing SRSF9.txt
Error in py_run_file_impl(file, local, convert) :
ModuleNotFoundError: No module named 'RBPamp.params'
Run reticulate::py_last_error()
for details.
reticulate::py_last_error()
── Python Exception Message ────────────────────────────────────────────────────────────────────────────────────────────────
Traceback (most recent call last):
File "RBPamp_aff_local.py", line 44, in
── R Traceback ───────────────────────────────────────────────────────────────────────────────────────────────────────────── ▆
Just to note, I am running this within the activated RBPamp conda environment.
To add, as I've seen this in previous issues: I've updated cython within the RBPamp conda environment:
Collecting cython Downloading Cython-3.0.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB) Downloading Cython-3.0.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 3.5 MB/s eta 0:00:00 Installing collected packages: cython Attempting uninstall: cython Found existing installation: Cython 0.29.32 Uninstalling Cython-0.29.32: Successfully uninstalled Cython-0.29.32 Successfully installed cython-3.0.9
Best Miruna
That will happen if RBPamp is not installed successfully (the params come from the motifs2 file). You're not getting any errors when running the dockerfile?
On Mon, Mar 25, 2024 at 9:08 AM mirunabarbu1 @.***> wrote:
Hello,
I have installed both numpy directly in the docker and re-ran the function. I now get the following error (I am trying a variant that you have confirmed to be in your list in a previous issue on GitHub). Would you happen to know what the issue is?
Enter 3'UTR single nucleotide variant chromosome number: 1 Enter variant position in 1-based coordinates: 90916206 Enter hg38 reference base (capitalized) at variant position: C Enter variant base (capitalized): T Enter a name for this variant (no spaces): test gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" processing SRSF9.txt Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'RBPamp.params' Run reticulate::py_last_error() for details.
Just to note, I am running this within the activated RBPamp conda environment.
Best Miruna
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2017973407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDFQQK7BAVTNANEAHC3Y2AOVTAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJXHE3TGNBQG4 . You are receiving this because you commented.Message ID: @.***>
Hello
No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.
Is the motifs2 folder there in inst->extdata?
On Mon, Mar 25, 2024 at 9:42 AM mirunabarbu1 @.***> wrote:
Hello
No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.
5 https://github.com/RomoL2/RegVar/issues/5
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018035873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDF5TVZCR5GDRSTXLR3Y2ASVPAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGAZTKOBXGM . You are receiving this because you commented.Message ID: @.***>
This is the dockerfile that is working on my end, just to confirm you're using the same:
FROM continuumio/miniconda3:23.3.1-0 MAINTAINER Lindsay Romo @.***>
LABEL \ version="0.0.1" \ description="Image for RegVar(https://github.com/RomoL2/RegVar)"
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \ build-essential \ curl \ git \ gcc-4.8\ git-lfs \ libcurl4-openssl-dev \ libfontconfig1-dev \ libfreetype6-dev \ libfribidi-dev \ libharfbuzz-dev \ libjpeg-dev \ libpng-dev \ libssl-dev \ libtiff5-dev \ libxml2-dev \ python3 \ bedtools \ python3-pip \ r-base \ vim \ wget \ && apt-get clean \ && rm -rf /var/lib/apt/lists/*
RUN conda install -n base -c conda-forge mamba
RUN R -e "install.packages('devtools', dependencies=TRUE)" RUN R -e "devtools::install_github('RomoL2/RegVar')"
WORKDIR /usr/local/lib/R/site-library/RegVar RUN rm -r extdata \ && wget https://zenodo.org/record/10646785/files/extdata.tar.gz \ && tar -xf extdata.tar.gz \ && rm extdata.tar.gz \ && cd extdata \ && wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz \ && gunzip hg38.fa.gz
WORKDIR /usr/local/lib/R/site-library/RegVar/extdata/RBPamp RUN mamba create --name RBPamp --file requirements.txt -c conda-forge --yes
RUN /opt/conda/envs/RBPamp/bin/pip install future-fstrings --force-reinstall \ && export CC=gcc \ && /opt/conda/envs/RBPamp/bin/python setup.py build \ && /opt/conda/envs/RBPamp/bin/python setup.py install
WORKDIR /
On Mon, Mar 25, 2024 at 9:57 AM Lindsay Romo @.***> wrote:
Is the motifs2 folder there in inst->extdata?
On Mon, Mar 25, 2024 at 9:42 AM mirunabarbu1 @.***> wrote:
Hello
No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.
5 https://github.com/RomoL2/RegVar/issues/5
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018035873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDF5TVZCR5GDRSTXLR3Y2ASVPAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGAZTKOBXGM . You are receiving this because you commented.Message ID: @.***>
Hi, yes, here is what is in the folder - all files non-empty:
(RBPamp) root@7130d355cf37:/usr/local/lib/R/site-library/RegVar/extdata/motifs2# wc -l ./* 76 ./A1CF.txt 76 ./BOLL.txt 76 ./CELF1.txt 46 ./CNOT4.txt 46 ./CPEB1.txt 76 ./DAZ3.txt 57 ./DAZAP1.txt 61 ./EIF4G2.txt 71 ./ELAVL4.txt 61 ./ESRP1.txt 61 ./EWSR1.txt 76 ./FUBP1.txt 76 ./FUBP3.txt 15 ./FUS.txt 76 ./HNRNPA0.txt 76 ./HNRNPA1.txt 43 ./HNRNPA2B1.txt 31 ./HNRNPC.txt 61 ./HNRNPCL1.txt 76 ./HNRNPD.txt 76 ./HNRNPDL.txt 61 ./HNRNPF.txt 76 ./HNRNPH2.txt 76 ./HNRNPK.txt 76 ./HNRNPL.txt 61 ./IGF2BP1.txt 76 ./IGF2BP2.txt 76 ./ILF2.txt 76 ./KHDRBS2.txt 76 ./KHDRBS3.txt 76 ./KHSRP.txt 76 ./MBNL1.txt 76 ./MSI1.txt 61 ./NOVA1.txt 61 ./NUPL2.txt 57 ./PABPN1L.txt 76 ./PCBP1.txt 61 ./PCBP2.txt 76 ./PCBP4.txt 61 ./PRR3.txt 76 ./PTBP3.txt 76 ./PUF60.txt 76 ./PUM1.txt 76 ./PUM2.txt 46 ./RALY.txt 16 ./RBFOX2.txt 16 ./RBFOX3.txt 76 ./RBM15B.txt 46 ./RBM22.txt 43 ./RBM23.txt 46 ./RBM24.txt 61 ./RBM25.txt 76 ./RBM4.txt 76 ./RBM41.txt 76 ./RBM45.txt 61 ./RBM47.txt 61 ./RBM4B.txt 76 ./RBM6.txt 61 ./RBMS2.txt 46 ./RBMS3.txt 76 ./RC3H1.txt 61 ./SF1.txt 76 ./SFPQ.txt 46 ./SNRPA.txt 31 ./SRSF10.txt 15 ./SRSF11.txt 46 ./SRSF2.txt 15 ./SRSF4.txt 31 ./SRSF5.txt 76 ./SRSF8.txt 31 ./SRSF9.txt 43 ./TAF15.txt 46 ./TARDBP.txt 76 ./TIA1.txt 76 ./TRA2A.txt 76 ./TRNAU1AP.txt 61 ./UNK.txt 46 ./ZCRB1.txt 61 ./ZFP36.txt 76 ./ZNF326.txt 4900 total
Ok, let me try reinstalling the docker on my end and confirm it's still working for me.
On Mon, Mar 25, 2024 at 10:00 AM mirunabarbu1 @.***> wrote:
Hi, yes, here is what is in the folder - all files non-empty:
(RBPamp) @.**:/usr/local/lib/R/site-library/RegVar/extdata/motifs2# wc -l ./ 76 ./A1CF.txt 76 ./BOLL.txt 76 ./CELF1.txt 46 ./CNOT4.txt 46 ./CPEB1.txt 76 ./DAZ3.txt 57 ./DAZAP1.txt 61 ./EIF4G2.txt 71 ./ELAVL4.txt 61 ./ESRP1.txt 61 ./EWSR1.txt 76 ./FUBP1.txt 76 ./FUBP3.txt 15 ./FUS.txt 76 ./HNRNPA0.txt 76 ./HNRNPA1.txt 43 ./HNRNPA2B1.txt 31 ./HNRNPC.txt 61 ./HNRNPCL1.txt 76 ./HNRNPD.txt 76 ./HNRNPDL.txt 61 ./HNRNPF.txt 76 ./HNRNPH2.txt 76 ./HNRNPK.txt 76 ./HNRNPL.txt 61 ./IGF2BP1.txt 76 ./IGF2BP2.txt 76 ./ILF2.txt 76 ./KHDRBS2.txt 76 ./KHDRBS3.txt 76 ./KHSRP.txt 76 ./MBNL1.txt 76 ./MSI1.txt 61 ./NOVA1.txt 61 ./NUPL2.txt 57 ./PABPN1L.txt 76 ./PCBP1.txt 61 ./PCBP2.txt 76 ./PCBP4.txt 61 ./PRR3.txt 76 ./PTBP3.txt 76 ./PUF60.txt 76 ./PUM1.txt 76 ./PUM2.txt 46 ./RALY.txt 16 ./RBFOX2.txt 16 ./RBFOX3.txt 76 ./RBM15B.txt 46 ./RBM22.txt 43 ./RBM23.txt 46 ./RBM24.txt 61 ./RBM25.txt 76 ./RBM4.txt 76 ./RBM41.txt 76 ./RBM45.txt 61 ./RBM47.txt 61 ./RBM4B.txt 76 ./RBM6.txt 61 ./RBMS2.txt 46 ./RBMS3.txt 76 ./RC3H1.txt 61 ./SF1.txt 76 ./SFPQ.txt 46 ./SNRPA.txt 31 ./SRSF10.txt 15 ./SRSF11.txt 46 ./SRSF2.txt 15 ./SRSF4.txt 31 ./SRSF5.txt 76 ./SRSF8.txt 31 ./SRSF9.txt 43 ./TAF15.txt 46 ./TARDBP.txt 76 ./TIA1.txt 76 ./TRA2A.txt 76 ./TRNAU1AP.txt 61 ./UNK.txt 46 ./ZCRB1.txt 61 ./ZFP36.txt 76 ./ZNF326.txt 4900 total
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018072981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDFX6RZXM5T5IPN273Y2AUZBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGA3TEOJYGE . You are receiving this because you commented.Message ID: @.***>
Yes, confirming I have the same contents in my dockerfile. I have re-built the container and re-ran the code above - it now runs the motifs2 files , but I now get this error:
Error in reticulate::use_python(path_to_python, required = TRUE) : object 'path_to_python' not found
There is something going wrong with your RBPamp. For some reason it is not installed correctly. Will confirm shortly whether it installs on my computer. You could try installing outside of docker per the installation instructions.
Ok - thank you. I am unfortunately not able to install this outside of docker as I am working within an HPC and do not have installation permissions.
Hello again
Just to update, after re-building the container I included these lines in R and RegVar worked for the CharacterizeVariants_single_input option:
reticulate::use_condaenv('RBPamp', required=TRUE)
path_to_python <- "/opt/conda/envs/RBPamp/bin/python"
reticulate::use_python(path_to_python, required = TRUE)
However, when I try to submit a VCF file, I get the same error stating the vcf file is not found. I'd like to try and see if the solution above works with a vcf file (as I would need to submit ~1,500 variants to RegVar), however it does not work. Would you have any thoughts on this?
Error in CharacterizeVariants("./grch38_3prime_regvar_input_final.vcf", :
object 'vcf' not found
Thank you
Oh good glad at least one function is working. I’ll look into the vcf issue this afternoon and get back to you.
On Mon, Mar 25, 2024 at 12:36 PM mirunabarbu1 @.***> wrote:
Hello again
Just to update, after re-building the container I included these lines in R and RegVar worked for the CharacterizeVariants_single_input option:
reticulate::use_condaenv('RBPamp', required=TRUE) path_to_python <- "/opt/conda/envs/RBPamp/bin/python" reticulate::use_python(path_to_python, required = TRUE)
However, when I try to submit a VCF file, I get the same error stating the vcf file is not found. I'd like to try and see if the solution above works with a vcf file (as I would need to submit ~1,500 variants to RegVar), however it does not work. Would you have any thoughts on this?
Error in CharacterizeVariants("./grch38_3prime_regvar_input_final.vcf", : object 'vcf' not found
Thank you
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018427491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDD3XWPU7QHDDZWJZVLY2BHCRAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGQZDONBZGE . You are receiving this because you commented.Message ID: @.***>
Thank you! I've had a quick look as well and it looks like the CharacterizeVariants function within the dockerfile does not read in the filename at any point. Not sure if this is the issue.
Haha, yes, that sounds like an issue for sure. I'll start by looking into that. Maybe I just need to send you an updated R file for that function.
On Mon, Mar 25, 2024 at 1:00 PM mirunabarbu1 @.***> wrote:
Thank you! I've had a quick look as well and it looks like the CharacterizeVariants function within the dockerfile does not read in the filename at any point. Not sure if this is the issue.
CharacterizeVariants_docker.log https://github.com/RomoL2/RegVar/files/14747081/CharacterizeVariants_docker.log
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018478074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDHB5I65T4UOP5UF6N3Y2BJ4BAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGQ3TQMBXGQ . You are receiving this because you commented.Message ID: @.***>
Thanks - I actually ran the function with some modified lines to read in the VCF file, but it would be great to have your updated function too.
It looks like it ran successfully, but out of my total ~1,500 variants I get back ~100. Would you be able to let me know why such few variants are returned? I see here: https://github.com/RomoL2/RegVar/issues/8 you provide some information on how you defined canonical 3'UTR coordinates: using PAS data from Lianoglou et al 2013. Could you let me know which specific dataset you used from this paper?
Any further information on variant missingness would be great. Just to note, I am investigating rare variants (AF < 1%) if that is helpful.
Thank you!
that is a lower yield than I would expect if you're using the same coordinates. Did you use the updated R file that I posted after issue 8? You can do a manual double check- if you look at IGV, you can use the Mayr lab APA atlas (which has the same UTR definitions that i used)- it's in hg37 but i transformed to hg38 for my purposes. http://cbio.mskcc.org/leslielab/ApA/atlas If you find a variant that looks like it's in a UTR but wasn't reported in the output, send it to me and i can look into what happened (you shouldn't have that many lost variants since these are fairly permissive UTR definitions).
oh and I used the combined dataset from that paper (all cells/tissues) to define the UTRs.
Ok, thanks for the further information. The link doesn't work for me (it loads for a while and then says the site can't be reached).
I'll try and run the function line by line and see where the drop-out occurs. After selecting for single nucleotide variants there are 1472 remaining variants - but unsure where the drop-out occurs from then onwards.
Which R file are you referring to? I ran the docker version that you pasted above, which I presume is the latest version.
Best Miruna
oh yes, the docker should pull the latest from github. If you want to send me your vcf i am also happy to run it and figure out where the drop out happens and fix the code as necessary. definitely don't want that many lost variants.
(and I think the Mayr lab server may be down because the site previously worked but is currently not working for me either).
Hello
I did some additional testing & counted the variant drop-out at each stage. In short, the following happens:
N = 1472 single nucleotide variants N = 1198 after incorporating microRNAs N = 96 after running ClinVar bit
I've left out the ClinVar bit and the entire code runs with 1198 remaining variants at the end, with all other information incorporated. The ClinVar intersected variants are also N=96, so the issue is definitely occurring at this stage. I am actually using ClinVar through another source so I am happy to leave this out of the RegVar run - would you think this is fine?
print("incorporating ClinVar")
vcf_UTR[, `:=`(tmp_key, seq(1:nrow(vcf_UTR)))]
data.table::fwrite(vcf_UTR[, c("chrom", "chromStart", "chromEnd",
"tmp_key")], "vcf_UTR_tmp.bed", col.names = FALSE, row.names = FALSE,
quote = FALSE, sep = "\t")
system("bedtools intersect -a vcf_UTR_tmp.bed -b clinvar_for_script.bed -wa -wb > vcf_UTR_clinvar.bed")
suppressWarnings(tmp <- data.table::fread("vcf_UTR_clinvar.bed",
header = F, sep = "\t"))
# N tmp ClinVar = 96
if (nrow(tmp) > 0) {
names(tmp) <- c("chrom", "chromStart", "chromEnd", "tmp_key",
"chr2", "window_start", "window_stop", "info")
tmp <- tmp[, c("tmp_key", "info")]
data.table::setkey(tmp, tmp_key)
data.table::setkey(vcf_UTR, tmp_key)
vcf_UTR <- vcf_UTR[tmp]
for (n in 1:length(unique(tmp$tmp_key))) {
matches <- tmp[tmp_key == tmp$tmp_key[n], ]
vcf_UTR[tmp_key == tmp$tmp_key[n], `:=`(clinvar_info,
paste(matches$info, collapse = "|"))]
}
vcf_UTR[, `:=`(tmp_key, NULL)]
vcf_UTR <- unique(vcf_UTR)
} else {
vcf_UTR$clinvar_info <- NA
}
Additionally, what are your thoughts on the drop out from 1472 to 1198 after incorporating microRNAs - I am hoping this is an expected number of variants to be dropped out.
Best wishes and thank you for all the help so far! Miruna
There shouldn’t be any variants lost at either stage which means I need to fix the code. I’ll let you know once I have done so, and thanks for noticing that!
Lindsay
On Mon, Mar 25, 2024 at 4:47 PM mirunabarbu1 @.***> wrote:
Hello
I did some additional testing & counted the variant drop-out at each stage. In short, the following happens:
N = 1472 single nucleotide variants N = 1998 after incorporating microRNAs N = 96 after running ClinVar bit
I've left out the ClinVar bit and the entire code runs with 1998 remaining variants at the end, with all other information incorporated. The ClinVar intersected variants are also N=96, so the issue is definitely occurring at this stage. I am actually using ClinVar through another source so I am happy to leave this out of the RegVar run - would you think this is fine?
print("incorporating ClinVar") vcf_UTR[, :=`(tmp_key, seq(1:nrow(vcf_UTR)))] data.table::fwrite(vcf_UTR[, c("chrom", "chromStart", "chromEnd", "tmp_key")], "vcf_UTR_tmp.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t") system("bedtools intersect -a vcf_UTR_tmp.bed -b clinvar_for_script.bed -wa -wb > vcf_UTR_clinvar.bed") suppressWarnings(tmp <- data.table::fread("vcf_UTR_clinvar.bed", header = F, sep = "\t")) N tmp ClinVar = 96
if (nrow(tmp) > 0) { names(tmp) <- c("chrom", "chromStart", "chromEnd", "tmp_key", "chr2", "window_start", "window_stop", "info") tmp <- tmp[, c("tmp_key", "info")] data.table::setkey(tmp, tmp_key) data.table::setkey(vcf_UTR, tmp_key) vcf_UTR <- vcf_UTR[tmp] for (n in 1:length(unique(tmp$tmp_key))) { matches <- tmp[tmp_key == tmp$tmp_key[n], ] vcf_UTR[tmp_key == tmp$tmp_key[n], :=(clinvar_info, paste(matches$info, collapse = "|"))] } vcf_UTR[, :=(tmp_key, NULL)] vcf_UTR <- unique(vcf_UTR) } else { vcf_UTR$clinvar_info <- NA } `
Additionally, what are your thoughts on the drop out from 1472 to 1198 after incorporating microRNAs - I am hoping this is an expected number of variants to be dropped out.
Best wishes and thank you for all the help so far! Miruna
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018885530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDEKRRSJT4DJHO45SGLY2CEN5AVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYHA4DKNJTGA . You are receiving this because you commented.Message ID: @.***>
Ok i found the error and I think it's fixed! This is probably tmi, but of course it was a dumb mistake with datable merging (accidentally had dt_new<-dt1[dt2] instead of dt_new<-dt2[dt1], which caused all variants not near a ClinVar variant or in a miR site to drop out). Thank you for noticing this! Can you rerun with the new R file (uploaded on github) and confirm it is now working appropriately? I also noticed that the vcf read in was commented out from the last time I was debugging, so that's why it wasn't reading in your vcf. That should be working now as well.
Lindsay
Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!
Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 (https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:
intersection <- rbind(singles, triples_expanded, quads_expanded)
# expand triple and quadruple alleles
if (max(intersection$allele_count)==3) {
triples <- intersection[allele_count == 3, ]
triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded)
} else if (max(intersection$allele_count)==4) {
quads <- intersection[allele_count == 4, ]
quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded, quads_expanded)
} else {
intersection <- singles
}
variants<-intersection
I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:
# expand triple and quadruple alleles
if (max(intersection$allele_count)==3) {
triples <- intersection[allele_count == 3, ]
triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded)
} else if (max(intersection$allele_count)==4) {
triples <- intersection[allele_count == 3, ]
triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))
quads <- intersection[allele_count == 4, ]
quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded, quads_expanded)
} else {
intersection <- singles
}
Thanks again for being so hands-on with the code!
Will check tonight and get back to you!
Glad it’s otherwise working.
Lindsay
On Tue, Mar 26, 2024 at 6:58 AM mirunabarbu1 @.***> wrote:
Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!
Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 ( https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:
intersection <- rbind(singles, triples_expanded, quads_expanded)
expand triple and quadruple alleles
if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) { quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded) } else { intersection <- singles } variants<-intersection
I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:
expand triple and quadruple alleles
if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) {
triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded)
} else { intersection <- singles }
Thanks again for being so hands-on with the code!
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2020120093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDIPCPBGTNKA4EM2RTY2FBEBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZDAMBZGM . You are receiving this because you commented.Message ID: @.***>
I’m sorry- I got very sick but I promise I haven’t forgotten this. Will get to it asap.
Lindsay
On Tue, Mar 26, 2024 at 1:25 PM Lindsay Romo @.***> wrote:
Will check tonight and get back to you!
Glad it’s otherwise working.
Lindsay
On Tue, Mar 26, 2024 at 6:58 AM mirunabarbu1 @.***> wrote:
Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!
Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 ( https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:
intersection <- rbind(singles, triples_expanded, quads_expanded)
expand triple and quadruple alleles
if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) { quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded) } else { intersection <- singles } variants<-intersection
I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:
expand triple and quadruple alleles
if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) {
triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded)
} else { intersection <- singles }
Thanks again for being so hands-on with the code!
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2020120093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDIPCPBGTNKA4EM2RTY2FBEBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZDAMBZGM . You are receiving this because you commented.Message ID: @.***>
I ran into the following error while running the CharacterizeVariants. Hoping to get a fix for it. Thanks ... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants" [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error: variable 'region' was fitted with type "numeric" but type "logical" was supplied In addition: Warning message: In CharacterizeVariants("regvar.DenovoCNN-Strelka.testdata.vcf", : some variants are not in canonical UTR coordinates
Oh interesting. I know where the problem is but I am currently on vacation- if you send me your input file I can check it out early next week (in case you want to try to do it yourself- the issue is with the glm- one of the columns is expecting numbers but doesn’t have numerical data for some reason).
Lindsay
On Tue, May 21, 2024 at 9:03 AM Jude Felix TENYWA @.***> wrote:
I ran into the following error while running the CharacterizeVariants. Hoping to get a fix for it. Thanks ... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants" [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error: variable 'region' was fitted with type "numeric" but type "logical" was supplied In addition: Warning message: In CharacterizeVariants("regvar.DenovoCNN-Strelka.testdata.vcf", : some variants are not in canonical UTR coordinates
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122589080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDD2VVO6ATPIRO5UHPDZDNAZ7AVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGU4DSMBYGA . You are receiving this because you commented.Message ID: @.***>
I'm working with a dataset containing genetic variants from patients, which unfortunately I cannot share due to privacy concerns. I'll keep you updated on any progress I make. In the meantime, I'd appreciate any insights you might have on resolving this issue or potential workarounds.
Thank you Jude
Ok- I would look at the code for the function and run it until the glm section for predicting eqtl or gwas variants (you can also just comment that out if that’s not something you care about), and then look at the vcf_utr variable I think it’s called and see if one of the columns that should be numbers is something else. I can send you later what the data should look like at that point. Does that make sense?
Lindsay
On Tue, May 21, 2024 at 9:49 AM Jude Felix TENYWA @.***> wrote:
I'm working with a dataset containing genetic variants from patients, which unfortunately I cannot share due to privacy concerns. I'll keep you updated on any progress I make. In the meantime, I'd appreciate any insights you might have on resolving this issue or potential workarounds.
Thank you Jude
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122684069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDFQZN4I7C7Y3FKISRTZDNGGXAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGY4DIMBWHE . You are receiving this because you commented.Message ID: @.***>
Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.
Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:
gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates
Ok- this looks like a bug I may need to fix- triples are coordinates with two variants. It should still work if there are none of those but for some reason is not. Let me take a look at the code and get back to you.
Lindsay
On Tue, May 21, 2024 at 11:13 AM Jude Felix TENYWA @.***> wrote:
Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.
Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:
gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122862307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDA4CS2XKAS6WKVT5HDZDNQBLAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSHA3DEMZQG4 . You are receiving this because you commented.Message ID: @.***>
So the problem is with lines 218-224. It’s not obvious to me why you’re getting that error from just looking at it- I would need my computer to actually run it. Which I can do once I’m home. Are there coordinates in that dataset with two possible snvs? That’s where the hitch is happening:
On Tue, May 21, 2024 at 1:38 PM Lindsay Romo @.***> wrote:
Ok- this looks like a bug I may need to fix- triples are coordinates with two variants. It should still work if there are none of those but for some reason is not. Let me take a look at the code and get back to you.
Lindsay
On Tue, May 21, 2024 at 11:13 AM Jude Felix TENYWA < @.***> wrote:
Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.
Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:
gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122862307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDA4CS2XKAS6WKVT5HDZDNQBLAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSHA3DEMZQG4 . You are receiving this because you commented.Message ID: @.***>
I proceeded to split my Clinvar VCF to ensure the variants within a file had unique coordinates as a workaround for the previous error. the error was resolved this way, however, I encountered an error with one of the split VCFS (attached for your reference), which I hope you can also address in your upcoming patch.
... [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error in grepl("consFam", miR_info, perl = TRUE) : object 'miR_info' not found In addition: Warning messages: 1: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not single nucleotide 2: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not in canonical UTR coordinates 3: In rm(miR_predictions) : object 'miR_predictions' not found
K I’ll work on this as soon as I get back, which will be early next week. I’ll let you know when things are fixed.
On Wed, May 22, 2024 at 8:58 AM Jude Felix TENYWA @.***> wrote:
I proceeded to split my Clinvar VCF to ensure they had unique positions within them as a workaround for the previous error. However, I encountered an error with one of the split VCFS (attached for your reference), which I hope you can address in your upcoming patch.
... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants"
[1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error in grepl("consFam", miR_info, perl = TRUE) : object 'miR_info' not found In addition: Warning messages: 1: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not single nucleotide 2: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not in canonical UTR coordinates 3: In rm(miR_predictions) : object 'miR_predictions' not found
output_file8.vcf.gz https://github.com/RomoL2/RegVar/files/15403535/output_file8.vcf.gz
— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2124736773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDERBRKUJNC3N3ORFRDZDSI6VAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUG4ZTMNZXGM . You are receiving this because you commented.Message ID: @.***>
Ok, the problem with the multiple variants per coordinate is fixed now; working on the other two issues.
are the vcf coordinates in the vcf you shared hg38? I am only getting 1/22 variants left after removing non-SNVs and intersecting with my UTR coordinates (most of the variants in the vcf you shared are indels rather than SNVs; RegVar only works for single nucleotide substitutions). The only variant left is chr14 53949882 53949883 C>T. I did fix the issue that was keeping this variant from running (the "object 'miR_info' not found" error), however, so that one variant should run now from the vcf file you sent me.
are the vcf coordinates in the vcf you shared hg38? I am only getting 1/22 variants left after removing non-SNVs and intersecting with my UTR coordinates (most of the variants in the vcf you shared are indels rather than SNVs; RegVar only works for single nucleotide substitutions). The only variant left is chr14 53949882 53949883 C>T. I did fix the issue that was keeping this variant from running (the "object 'miR_info' not found" error), however, so that one variant should run now from the vcf file you sent me.
Thank you for the fixes. They are hg38 coordinates in the vcf. I had indeed mixed the vcf to have snvs and non snvs for my testing.
Hello
I hope you are well. I am having an issue utilising RegVar as part of the Docker image here. I downloaded the dockerfile on 13 March 2024 and built the container with Docker, then built the image with Singularity. I also downloaded all files in /extdata. I tried running the CharacterizeVariants function (supplying a VCF file). I've also tried pointing to the specific input and output paths.
CharacterizeVariants
I also tried running the single input function:
CharacterizeVariants_single_input
I've checked my extdata directory and I've got all files in there.
Please let me know if you need further information.
Could you also let me know if this works on variants on genome build GRCh37, or is only applicable to GRCh38? Thank you!
Best wishes Miruna