RomoL2 / RegVar

Characterize 3'UTR variants
8 stars 2 forks source link

Issue using CharacterizeVariants functions #9

Open mirunabarbu1 opened 7 months ago

mirunabarbu1 commented 7 months ago

Hello

I hope you are well. I am having an issue utilising RegVar as part of the Docker image here. I downloaded the dockerfile on 13 March 2024 and built the container with Docker, then built the image with Singularity. I also downloaded all files in /extdata. I tried running the CharacterizeVariants function (supplying a VCF file). I've also tried pointing to the specific input and output paths.

CharacterizeVariants

CharacterizeVariants("grch38_variants.vcf", "~/", "~/") gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten Error in CharacterizeVariants("grch38_variants.vcf", "~/", : object 'vcf' not found

I also tried running the single input function:

CharacterizeVariants_single_input

CharacterizeVariants_single_input("./") Enter 3'UTR single nucleotide variant chromosome number: X Enter variant position in 1-based coordinates: 154021596 Enter hg38 reference base (capitalized) at variant position: G Enter variant base (capitalized): A Enter a name for this variant (no spaces): testvar1 gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Warning: the index file is older than the FASTA file. Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'numpy' Run reticulate::py_last_error() for details. In addition: Warning message: In rm(miR_predictions) : object 'miR_predictions' not found

I've checked my extdata directory and I've got all files in there.

Please let me know if you need further information.

Could you also let me know if this works on variants on genome build GRCh37, or is only applicable to GRCh38? Thank you!

Best wishes Miruna

RomoL2 commented 7 months ago

Hello, Did you use the dockerfile from the GitHub repository? It looks like for some reason numpy is not installed in your image for the single input. For the vcf, it is not reading in your vcf file for some reason- it does only work with hg38.

Lindsay

On Thu, Mar 14, 2024 at 1:20 PM mirunabarbu1 @.***> wrote:

Hello

I hope you are well. I am having an issue utilising RegVar as part of the Docker image here. I downloaded the dockerfile on 13 March 2024 and built the container with Docker, then built the image with Singularity. I also downloaded all files in /extdata. I tried running the CharacterizeVariants function (supplying a VCF file). I've also tried pointing to the specific input and output paths.

CharacterizeVariants

CharacterizeVariants("grch37_variants.vcf", "/", "/") gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten Error in CharacterizeVariants("grch37_variants.vcf", "~/", : object 'vcf' not found

I also tried running the single input function:

CharacterizeVariants_single_input

CharacterizeVariants_single_input("./") Enter 3'UTR single nucleotide variant chromosome number: X Enter variant position in 1-based coordinates: 154021596 Enter hg38 reference base (capitalized) at variant position: G Enter variant base (capitalized): A Enter a name for this variant (no spaces): testvar1 gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Warning: the index file is older than the FASTA file. Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'numpy' Run reticulate::py_last_error() for details. In addition: Warning message: In rm(miR_predictions) : object 'miR_predictions' not found

I've checked my extdata directory and I've got all files in there.

Please let me know if you need further information.

Could you also let me know if this works on variants on genome build GRCh37, or is only applicable to GRCh38? Thank you!

Best wishes Miruna

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDG2JY5IQLO7TQ6TFTTYYHL3RAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOBYGU4DIOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mirunabarbu1 commented 7 months ago

Hello! Thank you for the quick reply!

Yes, I downloaded the dockerfile form the GitHub repo and built it based on this (this was yesterday - mentioning this as I've seen you've last made changed to this file last month). I'm not sure why numpy wouldn't be installed if it's part of the dockerfile.

For the VCF file, this is what the header looks like (I am trialling this on a single chromosome at the moment). the chr, pos, ref, and alt columns are filled with variant information, however everything else is a ".". Would this be an issue?

fileformat=VCFv4.2

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

Thank you for clarifying on the GRCh38 question.

Best wishes Miruna

RomoL2 commented 7 months ago

That shouldn't be an issue for the VCF; I'm not sure why it's not loading correctly with datatable. Can you confirm that the dockerfile loaded without issues? You could try directly installing numpy in the docker and rerunning the single variant function.

Lindsay

On Thu, Mar 14, 2024 at 1:35 PM mirunabarbu1 @.***> wrote:

Hello! Thank you for the quick reply!

Yes, I downloaded the dockerfile form the GitHub repo and built it based on this (this was yesterday - mentioning this as I've seen you've last made changed to this file last month). I'm not sure why numpy wouldn't be installed if it's part of the dockerfile.

For the VCF file, this is what the header looks like (I am trialling this on a single chromosome at the moment). the chr, pos, ref, and alt columns are filled with variant information, however everything else is a ".". Would this be an issue?

fileformat=VCFv4.2

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

Thank you for clarifying on the GRCh38 question.

Best wishes Miruna

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-1997988953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDH4VAOSO5T54UOOROTYYHNXVAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXHE4DQOJVGM . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Hello,

I have installed both numpy directly in the docker and re-ran the function. I now get the following error (I am trying a variant that you have confirmed to be in your list in a previous issue on GitHub). Would you happen to know what the issue is?

Enter 3'UTR single nucleotide variant chromosome number: 1 Enter variant position in 1-based coordinates: 90916206 Enter hg38 reference base (capitalized) at variant position: C Enter variant base (capitalized): T Enter a name for this variant (no spaces): test gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" processing SRSF9.txt Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'RBPamp.params' Run reticulate::py_last_error() for details.

reticulate::py_last_error()

── Python Exception Message ──────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last): File "RBPamp_aff_local.py", line 44, in params = load_model(user_motif_location + curr_RBP) File "RBPamp_aff_local.py", line 23, in load_model from RBPamp.params import ModelSetParams File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 119, in _find_and_load_hook return _run_hook(name, _hook) File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 93, in _run_hook module = hook() File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 117, in _hook return _find_andload(name, import) ModuleNotFoundError: No module named 'RBPamp.params'

── R Traceback ───────────────────────────────────────────────────────────────────────────────────────────────────────────── ▆

  1. └─RegVar::CharacterizeVariants_single_input("./")
  2. └─reticulate::source_python("RBPamp_aff_local.py")
  3. └─reticulate::py_run_file(file, local = FALSE, convert = convert)
  4. └─reticulate:::py_run_file_impl(file, local, convert)

Just to note, I am running this within the activated RBPamp conda environment.

To add, as I've seen this in previous issues: I've updated cython within the RBPamp conda environment:

Collecting cython Downloading Cython-3.0.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB) Downloading Cython-3.0.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 3.5 MB/s eta 0:00:00 Installing collected packages: cython Attempting uninstall: cython Found existing installation: Cython 0.29.32 Uninstalling Cython-0.29.32: Successfully uninstalled Cython-0.29.32 Successfully installed cython-3.0.9

Best Miruna

RomoL2 commented 6 months ago

That will happen if RBPamp is not installed successfully (the params come from the motifs2 file). You're not getting any errors when running the dockerfile?

On Mon, Mar 25, 2024 at 9:08 AM mirunabarbu1 @.***> wrote:

Hello,

I have installed both numpy directly in the docker and re-ran the function. I now get the following error (I am trying a variant that you have confirmed to be in your list in a previous issue on GitHub). Would you happen to know what the issue is?

Enter 3'UTR single nucleotide variant chromosome number: 1 Enter variant position in 1-based coordinates: 90916206 Enter hg38 reference base (capitalized) at variant position: C Enter variant base (capitalized): T Enter a name for this variant (no spaces): test gzip: clinvar_for_script.bed already exists; do you wish to overwrite (y or n)? n not overwritten gzip: motifs2 already exists; do you wish to overwrite (y or n)? n not overwritten [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" processing SRSF9.txt Error in py_run_file_impl(file, local, convert) : ModuleNotFoundError: No module named 'RBPamp.params' Run reticulate::py_last_error() for details.

Just to note, I am running this within the activated RBPamp conda environment.

Best Miruna

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2017973407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDFQQK7BAVTNANEAHC3Y2AOVTAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJXHE3TGNBQG4 . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Hello

No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.

https://github.com/RomoL2/RegVar/issues/5

RomoL2 commented 6 months ago

Is the motifs2 folder there in inst->extdata?

On Mon, Mar 25, 2024 at 9:42 AM mirunabarbu1 @.***> wrote:

Hello

No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.

5 https://github.com/RomoL2/RegVar/issues/5

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018035873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDF5TVZCR5GDRSTXLR3Y2ASVPAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGAZTKOBXGM . You are receiving this because you commented.Message ID: @.***>

RomoL2 commented 6 months ago

This is the dockerfile that is working on my end, just to confirm you're using the same:

FROM continuumio/miniconda3:23.3.1-0 MAINTAINER Lindsay Romo @.***>

LABEL \ version="0.0.1" \ description="Image for RegVar(https://github.com/RomoL2/RegVar)"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \ build-essential \ curl \ git \ gcc-4.8\ git-lfs \ libcurl4-openssl-dev \ libfontconfig1-dev \ libfreetype6-dev \ libfribidi-dev \ libharfbuzz-dev \ libjpeg-dev \ libpng-dev \ libssl-dev \ libtiff5-dev \ libxml2-dev \ python3 \ bedtools \ python3-pip \ r-base \ vim \ wget \ && apt-get clean \ && rm -rf /var/lib/apt/lists/*

RUN conda install -n base -c conda-forge mamba

RUN R -e "install.packages('devtools', dependencies=TRUE)" RUN R -e "devtools::install_github('RomoL2/RegVar')"

WORKDIR /usr/local/lib/R/site-library/RegVar RUN rm -r extdata \ && wget https://zenodo.org/record/10646785/files/extdata.tar.gz \ && tar -xf extdata.tar.gz \ && rm extdata.tar.gz \ && cd extdata \ && wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz \ && gunzip hg38.fa.gz

install RBPamp

WORKDIR /usr/local/lib/R/site-library/RegVar/extdata/RBPamp RUN mamba create --name RBPamp --file requirements.txt -c conda-forge --yes

RUN /opt/conda/envs/RBPamp/bin/pip install future-fstrings --force-reinstall \ && export CC=gcc \ && /opt/conda/envs/RBPamp/bin/python setup.py build \ && /opt/conda/envs/RBPamp/bin/python setup.py install

WORKDIR /

On Mon, Mar 25, 2024 at 9:57 AM Lindsay Romo @.***> wrote:

Is the motifs2 folder there in inst->extdata?

On Mon, Mar 25, 2024 at 9:42 AM mirunabarbu1 @.***> wrote:

Hello

No, the dockerfile is installed with no errors as far as I know. I am just re-building the container using the dockerfile you specified here on 11 February, as I saw the user had the same error as I did. Not sure how else to de-bug.

5 https://github.com/RomoL2/RegVar/issues/5

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018035873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDF5TVZCR5GDRSTXLR3Y2ASVPAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGAZTKOBXGM . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Hi, yes, here is what is in the folder - all files non-empty:

(RBPamp) root@7130d355cf37:/usr/local/lib/R/site-library/RegVar/extdata/motifs2# wc -l ./* 76 ./A1CF.txt 76 ./BOLL.txt 76 ./CELF1.txt 46 ./CNOT4.txt 46 ./CPEB1.txt 76 ./DAZ3.txt 57 ./DAZAP1.txt 61 ./EIF4G2.txt 71 ./ELAVL4.txt 61 ./ESRP1.txt 61 ./EWSR1.txt 76 ./FUBP1.txt 76 ./FUBP3.txt 15 ./FUS.txt 76 ./HNRNPA0.txt 76 ./HNRNPA1.txt 43 ./HNRNPA2B1.txt 31 ./HNRNPC.txt 61 ./HNRNPCL1.txt 76 ./HNRNPD.txt 76 ./HNRNPDL.txt 61 ./HNRNPF.txt 76 ./HNRNPH2.txt 76 ./HNRNPK.txt 76 ./HNRNPL.txt 61 ./IGF2BP1.txt 76 ./IGF2BP2.txt 76 ./ILF2.txt 76 ./KHDRBS2.txt 76 ./KHDRBS3.txt 76 ./KHSRP.txt 76 ./MBNL1.txt 76 ./MSI1.txt 61 ./NOVA1.txt 61 ./NUPL2.txt 57 ./PABPN1L.txt 76 ./PCBP1.txt 61 ./PCBP2.txt 76 ./PCBP4.txt 61 ./PRR3.txt 76 ./PTBP3.txt 76 ./PUF60.txt 76 ./PUM1.txt 76 ./PUM2.txt 46 ./RALY.txt 16 ./RBFOX2.txt 16 ./RBFOX3.txt 76 ./RBM15B.txt 46 ./RBM22.txt 43 ./RBM23.txt 46 ./RBM24.txt 61 ./RBM25.txt 76 ./RBM4.txt 76 ./RBM41.txt 76 ./RBM45.txt 61 ./RBM47.txt 61 ./RBM4B.txt 76 ./RBM6.txt 61 ./RBMS2.txt 46 ./RBMS3.txt 76 ./RC3H1.txt 61 ./SF1.txt 76 ./SFPQ.txt 46 ./SNRPA.txt 31 ./SRSF10.txt 15 ./SRSF11.txt 46 ./SRSF2.txt 15 ./SRSF4.txt 31 ./SRSF5.txt 76 ./SRSF8.txt 31 ./SRSF9.txt 43 ./TAF15.txt 46 ./TARDBP.txt 76 ./TIA1.txt 76 ./TRA2A.txt 76 ./TRNAU1AP.txt 61 ./UNK.txt 46 ./ZCRB1.txt 61 ./ZFP36.txt 76 ./ZNF326.txt 4900 total

RomoL2 commented 6 months ago

Ok, let me try reinstalling the docker on my end and confirm it's still working for me.

On Mon, Mar 25, 2024 at 10:00 AM mirunabarbu1 @.***> wrote:

Hi, yes, here is what is in the folder - all files non-empty:

(RBPamp) @.**:/usr/local/lib/R/site-library/RegVar/extdata/motifs2# wc -l ./ 76 ./A1CF.txt 76 ./BOLL.txt 76 ./CELF1.txt 46 ./CNOT4.txt 46 ./CPEB1.txt 76 ./DAZ3.txt 57 ./DAZAP1.txt 61 ./EIF4G2.txt 71 ./ELAVL4.txt 61 ./ESRP1.txt 61 ./EWSR1.txt 76 ./FUBP1.txt 76 ./FUBP3.txt 15 ./FUS.txt 76 ./HNRNPA0.txt 76 ./HNRNPA1.txt 43 ./HNRNPA2B1.txt 31 ./HNRNPC.txt 61 ./HNRNPCL1.txt 76 ./HNRNPD.txt 76 ./HNRNPDL.txt 61 ./HNRNPF.txt 76 ./HNRNPH2.txt 76 ./HNRNPK.txt 76 ./HNRNPL.txt 61 ./IGF2BP1.txt 76 ./IGF2BP2.txt 76 ./ILF2.txt 76 ./KHDRBS2.txt 76 ./KHDRBS3.txt 76 ./KHSRP.txt 76 ./MBNL1.txt 76 ./MSI1.txt 61 ./NOVA1.txt 61 ./NUPL2.txt 57 ./PABPN1L.txt 76 ./PCBP1.txt 61 ./PCBP2.txt 76 ./PCBP4.txt 61 ./PRR3.txt 76 ./PTBP3.txt 76 ./PUF60.txt 76 ./PUM1.txt 76 ./PUM2.txt 46 ./RALY.txt 16 ./RBFOX2.txt 16 ./RBFOX3.txt 76 ./RBM15B.txt 46 ./RBM22.txt 43 ./RBM23.txt 46 ./RBM24.txt 61 ./RBM25.txt 76 ./RBM4.txt 76 ./RBM41.txt 76 ./RBM45.txt 61 ./RBM47.txt 61 ./RBM4B.txt 76 ./RBM6.txt 61 ./RBMS2.txt 46 ./RBMS3.txt 76 ./RC3H1.txt 61 ./SF1.txt 76 ./SFPQ.txt 46 ./SNRPA.txt 31 ./SRSF10.txt 15 ./SRSF11.txt 46 ./SRSF2.txt 15 ./SRSF4.txt 31 ./SRSF5.txt 76 ./SRSF8.txt 31 ./SRSF9.txt 43 ./TAF15.txt 46 ./TARDBP.txt 76 ./TIA1.txt 76 ./TRA2A.txt 76 ./TRNAU1AP.txt 61 ./UNK.txt 46 ./ZCRB1.txt 61 ./ZFP36.txt 76 ./ZNF326.txt 4900 total

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018072981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDFX6RZXM5T5IPN273Y2AUZBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGA3TEOJYGE . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Yes, confirming I have the same contents in my dockerfile. I have re-built the container and re-ran the code above - it now runs the motifs2 files , but I now get this error:

Error in reticulate::use_python(path_to_python, required = TRUE) : object 'path_to_python' not found

RomoL2 commented 6 months ago

There is something going wrong with your RBPamp. For some reason it is not installed correctly. Will confirm shortly whether it installs on my computer. You could try installing outside of docker per the installation instructions.

mirunabarbu1 commented 6 months ago

Ok - thank you. I am unfortunately not able to install this outside of docker as I am working within an HPC and do not have installation permissions.

mirunabarbu1 commented 6 months ago

Hello again

Just to update, after re-building the container I included these lines in R and RegVar worked for the CharacterizeVariants_single_input option:

reticulate::use_condaenv('RBPamp', required=TRUE)
path_to_python <- "/opt/conda/envs/RBPamp/bin/python"
reticulate::use_python(path_to_python, required = TRUE)

However, when I try to submit a VCF file, I get the same error stating the vcf file is not found. I'd like to try and see if the solution above works with a vcf file (as I would need to submit ~1,500 variants to RegVar), however it does not work. Would you have any thoughts on this?

Error in CharacterizeVariants("./grch38_3prime_regvar_input_final.vcf",  : 
  object 'vcf' not found

Thank you

RomoL2 commented 6 months ago

Oh good glad at least one function is working. I’ll look into the vcf issue this afternoon and get back to you.

On Mon, Mar 25, 2024 at 12:36 PM mirunabarbu1 @.***> wrote:

Hello again

Just to update, after re-building the container I included these lines in R and RegVar worked for the CharacterizeVariants_single_input option:

reticulate::use_condaenv('RBPamp', required=TRUE) path_to_python <- "/opt/conda/envs/RBPamp/bin/python" reticulate::use_python(path_to_python, required = TRUE)

However, when I try to submit a VCF file, I get the same error stating the vcf file is not found. I'd like to try and see if the solution above works with a vcf file (as I would need to submit ~1,500 variants to RegVar), however it does not work. Would you have any thoughts on this?

Error in CharacterizeVariants("./grch38_3prime_regvar_input_final.vcf", : object 'vcf' not found

Thank you

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018427491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDD3XWPU7QHDDZWJZVLY2BHCRAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGQZDONBZGE . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Thank you! I've had a quick look as well and it looks like the CharacterizeVariants function within the dockerfile does not read in the filename at any point. Not sure if this is the issue.

CharacterizeVariants_docker.log

RomoL2 commented 6 months ago

Haha, yes, that sounds like an issue for sure. I'll start by looking into that. Maybe I just need to send you an updated R file for that function.

On Mon, Mar 25, 2024 at 1:00 PM mirunabarbu1 @.***> wrote:

Thank you! I've had a quick look as well and it looks like the CharacterizeVariants function within the dockerfile does not read in the filename at any point. Not sure if this is the issue.

CharacterizeVariants_docker.log https://github.com/RomoL2/RegVar/files/14747081/CharacterizeVariants_docker.log

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018478074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDHB5I65T4UOP5UF6N3Y2BJ4BAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGQ3TQMBXGQ . You are receiving this because you commented.Message ID: @.***>

mirunabarbu1 commented 6 months ago

Thanks - I actually ran the function with some modified lines to read in the VCF file, but it would be great to have your updated function too.

It looks like it ran successfully, but out of my total ~1,500 variants I get back ~100. Would you be able to let me know why such few variants are returned? I see here: https://github.com/RomoL2/RegVar/issues/8 you provide some information on how you defined canonical 3'UTR coordinates: using PAS data from Lianoglou et al 2013. Could you let me know which specific dataset you used from this paper?

Any further information on variant missingness would be great. Just to note, I am investigating rare variants (AF < 1%) if that is helpful.

Thank you!

RomoL2 commented 6 months ago

that is a lower yield than I would expect if you're using the same coordinates. Did you use the updated R file that I posted after issue 8? You can do a manual double check- if you look at IGV, you can use the Mayr lab APA atlas (which has the same UTR definitions that i used)- it's in hg37 but i transformed to hg38 for my purposes. http://cbio.mskcc.org/leslielab/ApA/atlas If you find a variant that looks like it's in a UTR but wasn't reported in the output, send it to me and i can look into what happened (you shouldn't have that many lost variants since these are fairly permissive UTR definitions).

RomoL2 commented 6 months ago

oh and I used the combined dataset from that paper (all cells/tissues) to define the UTRs.

mirunabarbu1 commented 6 months ago

Ok, thanks for the further information. The link doesn't work for me (it loads for a while and then says the site can't be reached).

I'll try and run the function line by line and see where the drop-out occurs. After selecting for single nucleotide variants there are 1472 remaining variants - but unsure where the drop-out occurs from then onwards.

Which R file are you referring to? I ran the docker version that you pasted above, which I presume is the latest version.

Best Miruna

RomoL2 commented 6 months ago

oh yes, the docker should pull the latest from github. If you want to send me your vcf i am also happy to run it and figure out where the drop out happens and fix the code as necessary. definitely don't want that many lost variants.

RomoL2 commented 6 months ago

(and I think the Mayr lab server may be down because the site previously worked but is currently not working for me either).

mirunabarbu1 commented 6 months ago

Hello

I did some additional testing & counted the variant drop-out at each stage. In short, the following happens:

N = 1472 single nucleotide variants N = 1198 after incorporating microRNAs N = 96 after running ClinVar bit

I've left out the ClinVar bit and the entire code runs with 1198 remaining variants at the end, with all other information incorporated. The ClinVar intersected variants are also N=96, so the issue is definitely occurring at this stage. I am actually using ClinVar through another source so I am happy to leave this out of the RegVar run - would you think this is fine?

print("incorporating ClinVar")
  vcf_UTR[, `:=`(tmp_key, seq(1:nrow(vcf_UTR)))]
  data.table::fwrite(vcf_UTR[, c("chrom", "chromStart", "chromEnd", 
                                 "tmp_key")], "vcf_UTR_tmp.bed", col.names = FALSE, row.names = FALSE, 
                     quote = FALSE, sep = "\t")
  system("bedtools intersect -a vcf_UTR_tmp.bed -b clinvar_for_script.bed -wa -wb > vcf_UTR_clinvar.bed")
  suppressWarnings(tmp <- data.table::fread("vcf_UTR_clinvar.bed", 
                                            header = F, sep = "\t"))

  # N tmp ClinVar = 96

  if (nrow(tmp) > 0) {
    names(tmp) <- c("chrom", "chromStart", "chromEnd", "tmp_key", 
                    "chr2", "window_start", "window_stop", "info")
    tmp <- tmp[, c("tmp_key", "info")]
    data.table::setkey(tmp, tmp_key)
    data.table::setkey(vcf_UTR, tmp_key)
    vcf_UTR <- vcf_UTR[tmp]
    for (n in 1:length(unique(tmp$tmp_key))) {
      matches <- tmp[tmp_key == tmp$tmp_key[n], ]
      vcf_UTR[tmp_key == tmp$tmp_key[n], `:=`(clinvar_info, 
                                              paste(matches$info, collapse = "|"))]
    }
    vcf_UTR[, `:=`(tmp_key, NULL)]
    vcf_UTR <- unique(vcf_UTR)
  } else {
    vcf_UTR$clinvar_info <- NA
  }

Additionally, what are your thoughts on the drop out from 1472 to 1198 after incorporating microRNAs - I am hoping this is an expected number of variants to be dropped out.

Best wishes and thank you for all the help so far! Miruna

RomoL2 commented 6 months ago

There shouldn’t be any variants lost at either stage which means I need to fix the code. I’ll let you know once I have done so, and thanks for noticing that!

Lindsay

On Mon, Mar 25, 2024 at 4:47 PM mirunabarbu1 @.***> wrote:

Hello

I did some additional testing & counted the variant drop-out at each stage. In short, the following happens:

N = 1472 single nucleotide variants N = 1998 after incorporating microRNAs N = 96 after running ClinVar bit

I've left out the ClinVar bit and the entire code runs with 1998 remaining variants at the end, with all other information incorporated. The ClinVar intersected variants are also N=96, so the issue is definitely occurring at this stage. I am actually using ClinVar through another source so I am happy to leave this out of the RegVar run - would you think this is fine?

print("incorporating ClinVar") vcf_UTR[, :=`(tmp_key, seq(1:nrow(vcf_UTR)))] data.table::fwrite(vcf_UTR[, c("chrom", "chromStart", "chromEnd", "tmp_key")], "vcf_UTR_tmp.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t") system("bedtools intersect -a vcf_UTR_tmp.bed -b clinvar_for_script.bed -wa -wb > vcf_UTR_clinvar.bed") suppressWarnings(tmp <- data.table::fread("vcf_UTR_clinvar.bed", header = F, sep = "\t")) N tmp ClinVar = 96

if (nrow(tmp) > 0) { names(tmp) <- c("chrom", "chromStart", "chromEnd", "tmp_key", "chr2", "window_start", "window_stop", "info") tmp <- tmp[, c("tmp_key", "info")] data.table::setkey(tmp, tmp_key) data.table::setkey(vcf_UTR, tmp_key) vcf_UTR <- vcf_UTR[tmp] for (n in 1:length(unique(tmp$tmp_key))) { matches <- tmp[tmp_key == tmp$tmp_key[n], ] vcf_UTR[tmp_key == tmp$tmp_key[n], :=(clinvar_info, paste(matches$info, collapse = "|"))] } vcf_UTR[, :=(tmp_key, NULL)] vcf_UTR <- unique(vcf_UTR) } else { vcf_UTR$clinvar_info <- NA } `

Additionally, what are your thoughts on the drop out from 1472 to 1198 after incorporating microRNAs - I am hoping this is an expected number of variants to be dropped out.

Best wishes and thank you for all the help so far! Miruna

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2018885530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDEKRRSJT4DJHO45SGLY2CEN5AVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYHA4DKNJTGA . You are receiving this because you commented.Message ID: @.***>

RomoL2 commented 6 months ago

Ok i found the error and I think it's fixed! This is probably tmi, but of course it was a dumb mistake with datable merging (accidentally had dt_new<-dt1[dt2] instead of dt_new<-dt2[dt1], which caused all variants not near a ClinVar variant or in a miR site to drop out). Thank you for noticing this! Can you rerun with the new R file (uploaded on github) and confirm it is now working appropriately? I also noticed that the vcf read in was commented out from the last time I was debugging, so that's why it wasn't reading in your vcf. That should be working now as well.

Lindsay

mirunabarbu1 commented 6 months ago

Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!

Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 (https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:

intersection <- rbind(singles, triples_expanded, quads_expanded)

# expand triple and quadruple alleles
  if (max(intersection$allele_count)==3) {
    triples <- intersection[allele_count == 3, ]
    triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
    triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                              triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
    data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))
    intersection <- rbind(singles, triples_expanded)
  } else if (max(intersection$allele_count)==4) {
    quads <- intersection[allele_count == 4, ]
    quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
    quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                            quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
                            quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
    data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
    intersection <- rbind(singles, triples_expanded, quads_expanded)
  } else {
    intersection <- singles
  }
  variants<-intersection

I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:

# expand triple and quadruple alleles
  if (max(intersection$allele_count)==3) {
    triples <- intersection[allele_count == 3, ]
    triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
    triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                              triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
    data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))
    intersection <- rbind(singles, triples_expanded)
  } else if (max(intersection$allele_count)==4) {

    triples <- intersection[allele_count == 3, ]
    triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
    triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                              triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
    data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))

    quads <- intersection[allele_count == 4, ]
    quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
    quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                            quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
                            quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
    data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
    intersection <- rbind(singles, triples_expanded, quads_expanded)
  } else {
    intersection <- singles
  }

Thanks again for being so hands-on with the code!

RomoL2 commented 6 months ago

Will check tonight and get back to you!

Glad it’s otherwise working.

Lindsay

On Tue, Mar 26, 2024 at 6:58 AM mirunabarbu1 @.***> wrote:

Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!

Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 ( https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:

intersection <- rbind(singles, triples_expanded, quads_expanded)

expand triple and quadruple alleles

if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) { quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded) } else { intersection <- singles } variants<-intersection

I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:

expand triple and quadruple alleles

if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) {

triples <- intersection[allele_count == 3, ]
triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                          triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))

quads <- intersection[allele_count == 4, ]
quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                        quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
                        quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded, quads_expanded)

} else { intersection <- singles }

Thanks again for being so hands-on with the code!

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2020120093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDIPCPBGTNKA4EM2RTY2FBEBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZDAMBZGM . You are receiving this because you commented.Message ID: @.***>

RomoL2 commented 6 months ago

I’m sorry- I got very sick but I promise I haven’t forgotten this. Will get to it asap.

Lindsay

On Tue, Mar 26, 2024 at 1:25 PM Lindsay Romo @.***> wrote:

Will check tonight and get back to you!

Glad it’s otherwise working.

Lindsay

On Tue, Mar 26, 2024 at 6:58 AM mirunabarbu1 @.***> wrote:

Hello! Thank you very much for fixing this so quickly - I've just re-run this and all my variants are returned (1472), so all good!

Just a final question on the code - when running line by line, I noticed in the loop on lines 218-235 ( https://github.com/RomoL2/RegVar/blob/main/R/CharacterizeVariants.R#L218) that the triples subsetting is not included in the scenario where the max allele count is 4 (the first "else" scenario). It happened that my variants had a max allele count of 4 and so was getting an error as the last line in the "max 4" scenario includes the "triples_expanded" variants too:

intersection <- rbind(singles, triples_expanded, quads_expanded)

expand triple and quadruple alleles

if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) { quads <- intersection[allele_count == 4, ] quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)] quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)], quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)]) data.table::setkeyv(quads_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded, quads_expanded) } else { intersection <- singles } variants<-intersection

I included the triples code in the "max allele count=4" scenario (see below) as otherwise these variants would be missed - just wanted to check if this is correct:

expand triple and quadruple alleles

if (max(intersection$allele_count)==3) { triples <- intersection[allele_count == 3, ] triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)] triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)], triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)]) data.table::setkeyv(triples_expanded, c("chrom", "chromStart")) intersection <- rbind(singles, triples_expanded) } else if (max(intersection$allele_count)==4) {

triples <- intersection[allele_count == 3, ]
triples[, c("name_1", "name_2") := data.table::tstrsplit(name, ",", fixed = TRUE)]
triples_expanded <- rbind(triples[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                          triples[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)])
data.table::setkeyv(triples_expanded, c("chrom", "chromStart"))

quads <- intersection[allele_count == 4, ]
quads[, c("name_1", "name_2", "name_3") := data.table::tstrsplit(name, ",", fixed = TRUE)]
quads_expanded <- rbind(quads[, .(chrom, chromStart, chromEnd, name = name_1, score, strand)],
                        quads[, .(chrom, chromStart, chromEnd, name = name_2, score, strand)],
                        quads[, .(chrom, chromStart, chromEnd, name = name_3, score, strand)])
data.table::setkeyv(quads_expanded, c("chrom", "chromStart"))
intersection <- rbind(singles, triples_expanded, quads_expanded)

} else { intersection <- singles }

Thanks again for being so hands-on with the code!

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2020120093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDDIPCPBGTNKA4EM2RTY2FBEBAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGEZDAMBZGM . You are receiving this because you commented.Message ID: @.***>

judefelixt commented 5 months ago

I ran into the following error while running the CharacterizeVariants. Hoping to get a fix for it. Thanks ... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants" [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error: variable 'region' was fitted with type "numeric" but type "logical" was supplied In addition: Warning message: In CharacterizeVariants("regvar.DenovoCNN-Strelka.testdata.vcf", : some variants are not in canonical UTR coordinates

RomoL2 commented 5 months ago

Oh interesting. I know where the problem is but I am currently on vacation- if you send me your input file I can check it out early next week (in case you want to try to do it yourself- the issue is with the glm- one of the columns is expecting numbers but doesn’t have numerical data for some reason).

Lindsay

On Tue, May 21, 2024 at 9:03 AM Jude Felix TENYWA @.***> wrote:

I ran into the following error while running the CharacterizeVariants. Hoping to get a fix for it. Thanks ... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants" [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error: variable 'region' was fitted with type "numeric" but type "logical" was supplied In addition: Warning message: In CharacterizeVariants("regvar.DenovoCNN-Strelka.testdata.vcf", : some variants are not in canonical UTR coordinates

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122589080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDD2VVO6ATPIRO5UHPDZDNAZ7AVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGU4DSMBYGA . You are receiving this because you commented.Message ID: @.***>

judefelixt commented 5 months ago

I'm working with a dataset containing genetic variants from patients, which unfortunately I cannot share due to privacy concerns. I'll keep you updated on any progress I make. In the meantime, I'd appreciate any insights you might have on resolving this issue or potential workarounds.

Thank you Jude

RomoL2 commented 5 months ago

Ok- I would look at the code for the function and run it until the glm section for predicting eqtl or gwas variants (you can also just comment that out if that’s not something you care about), and then look at the vcf_utr variable I think it’s called and see if one of the columns that should be numbers is something else. I can send you later what the data should look like at that point. Does that make sense?

Lindsay

On Tue, May 21, 2024 at 9:49 AM Jude Felix TENYWA @.***> wrote:

I'm working with a dataset containing genetic variants from patients, which unfortunately I cannot share due to privacy concerns. I'll keep you updated on any progress I make. In the meantime, I'd appreciate any insights you might have on resolving this issue or potential workarounds.

Thank you Jude

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122684069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDFQZN4I7C7Y3FKISRTZDNGGXAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGY4DIMBWHE . You are receiving this because you commented.Message ID: @.***>

judefelixt commented 5 months ago

Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.

Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:

gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates

RomoL2 commented 5 months ago

Ok- this looks like a bug I may need to fix- triples are coordinates with two variants. It should still work if there are none of those but for some reason is not. Let me take a look at the code and get back to you.

Lindsay

On Tue, May 21, 2024 at 11:13 AM Jude Felix TENYWA @.***> wrote:

Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.

Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:

gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122862307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDA4CS2XKAS6WKVT5HDZDNQBLAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSHA3DEMZQG4 . You are receiving this because you commented.Message ID: @.***>

RomoL2 commented 5 months ago

So the problem is with lines 218-224. It’s not obvious to me why you’re getting that error from just looking at it- I would need my computer to actually run it. Which I can do once I’m home. Are there coordinates in that dataset with two possible snvs? That’s where the hitch is happening:

On Tue, May 21, 2024 at 1:38 PM Lindsay Romo @.***> wrote:

Ok- this looks like a bug I may need to fix- triples are coordinates with two variants. It should still work if there are none of those but for some reason is not. Let me take a look at the code and get back to you.

Lindsay

On Tue, May 21, 2024 at 11:13 AM Jude Felix TENYWA < @.***> wrote:

Thank you for providing the workaround guidance. I'll continue investigating the root cause of the issue and hope to get it fixed.

Further in my analysis, I attempted to run the tool on the ClinVar dataset clinvar_20240221.vcf. However, I encountered the following error:

gzip: motifs2 already exists; do you wish to overwrite (y or n)? y gzip: motifs2: Is a directory [1] "incorporating eCLIP" [1] "incorporating eQTLS" [1] "incorporating GWAS" [1] "incorporating microRNAs" [1] "incorporating RBP motifs" Error in rbind(singles, triples_expanded, quads_expanded) : object 'triples_expanded' not found In addition: Warning messages: 1: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not single nucleotide 2: In CharacterizeVariants("regvar.clinvar_20240221.vcf", "/app/databases/clinvar/hg38/", : some variants are not in canonical UTR coordinates

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2122862307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDA4CS2XKAS6WKVT5HDZDNQBLAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSHA3DEMZQG4 . You are receiving this because you commented.Message ID: @.***>

judefelixt commented 4 months ago

I proceeded to split my Clinvar VCF to ensure the variants within a file had unique coordinates as a workaround for the previous error. the error was resolved this way, however, I encountered an error with one of the split VCFS (attached for your reference), which I hope you can also address in your upcoming patch.

... [1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error in grepl("consFam", miR_info, perl = TRUE) : object 'miR_info' not found In addition: Warning messages: 1: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not single nucleotide 2: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not in canonical UTR coordinates 3: In rm(miR_predictions) : object 'miR_predictions' not found

output_file8.vcf.gz

RomoL2 commented 4 months ago

K I’ll work on this as soon as I get back, which will be early next week. I’ll let you know when things are fixed.

On Wed, May 22, 2024 at 8:58 AM Jude Felix TENYWA @.***> wrote:

I proceeded to split my Clinvar VCF to ensure they had unique positions within them as a workaround for the previous error. However, I encountered an error with one of the split VCFS (attached for your reference), which I hope you can address in your upcoming patch.

... processing EIF4G2.txt processing RBM41.txt [1] "fetching CADD scores for gnomAD variants"

[1] "incorporating ClinVar" [1] "incorporating ClinVar" [1] "incorporating APA info" [1] "incorporating conservation scores" [1] "predicting GWAS or eQTLs" Error in grepl("consFam", miR_info, perl = TRUE) : object 'miR_info' not found In addition: Warning messages: 1: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not single nucleotide 2: In CharacterizeVariants("output_file8.vcf", "/app/data/patient/", : some variants are not in canonical UTR coordinates 3: In rm(miR_predictions) : object 'miR_predictions' not found

output_file8.vcf.gz https://github.com/RomoL2/RegVar/files/15403535/output_file8.vcf.gz

— Reply to this email directly, view it on GitHub https://github.com/RomoL2/RegVar/issues/9#issuecomment-2124736773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZOARDERBRKUJNC3N3ORFRDZDSI6VAVCNFSM6AAAAABEWQ3HKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUG4ZTMNZXGM . You are receiving this because you commented.Message ID: @.***>

RomoL2 commented 4 months ago

Ok, the problem with the multiple variants per coordinate is fixed now; working on the other two issues.

RomoL2 commented 4 months ago

are the vcf coordinates in the vcf you shared hg38? I am only getting 1/22 variants left after removing non-SNVs and intersecting with my UTR coordinates (most of the variants in the vcf you shared are indels rather than SNVs; RegVar only works for single nucleotide substitutions). The only variant left is chr14 53949882 53949883 C>T. I did fix the issue that was keeping this variant from running (the "object 'miR_info' not found" error), however, so that one variant should run now from the vcf file you sent me.

judefelixt commented 4 months ago

are the vcf coordinates in the vcf you shared hg38? I am only getting 1/22 variants left after removing non-SNVs and intersecting with my UTR coordinates (most of the variants in the vcf you shared are indels rather than SNVs; RegVar only works for single nucleotide substitutions). The only variant left is chr14 53949882 53949883 C>T. I did fix the issue that was keeping this variant from running (the "object 'miR_info' not found" error), however, so that one variant should run now from the vcf file you sent me.

Thank you for the fixes. They are hg38 coordinates in the vcf. I had indeed mixed the vcf to have snvs and non snvs for my testing.