calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
411 stars 126 forks source link

Getting NAs while trying to compute Variant effect predictions using basenji_sad.py #113

Open sukritikumar4 opened 2 years ago

sukritikumar4 commented 2 years ago

Hello there, Thank you for providing access to your analysis and models. Appreciate it deeply.

I hit into a small issue while trying to calculate SNP effects using basenji_sad.py. Here is the process I followed: -Initially I had a set of 100 variants that I used as inputs, however noticed that I was getting a few NAs while trying to run them. So I decided to use the tutorial data (rs339331) to check if I get these NAs. When I have only one variant I get a SAD score, but If I copy the same variant in a new line (so a duplicate variant in the vcf), I get NAs. Please find the screenshots of the output:

The VCF file used:

image

The output:

image

The command used: _! bin/basenji_sad.py -f data/hg19.ml.fa -o output/ --rc --shift "1,0,-1" tutorials/models/params_small.json models/heart/modelbest.h5 /tutorials/data/rs339331.vcf

Additionally, I have noticed that the SAD scores change for a variant, if I run it multiple times. Was wondering if my process is correct and why I am hitting into these issues.

davek44 commented 2 years ago

Hi, these aren't problems that I've seen before. The tutorial models are merely for demonstrations. I'd suggest working with one of our fully trained models, e.g. from here https://github.com/calico/basenji/tree/master/manuscripts/cross2020. If you're still encountering problems, you can email me your VCF, and I'll debug.

sukritikumar4 commented 2 years ago

Ah, I understand! Thank you so much for your response.

I just wanted to confirm if what I am doing is right, because my intention is to ultimately use the RF model and get a classification score for a set of non-coding variants with comparative AUCs reported in the publication. Please find the process I have followed so far: Step 1 Ran the basenji_sad.py command using params_human.json and model_human.h5 files instead of the old models mentioned in the tutorial. Step 2 Received the SAD scores with array dimensions (7207x5313) where the total number of variants I have is 7207 An observation in Step 2: Of the 7207 variants, there are three variants where the SAD score is NA even with the new model. Perhaps I could email these to you for further debugging. Step 3 Use this array as an input for the Human RF classification model to obtain a single classification probability (using predict_proba() for class1). Step 4 Use AUC based on Labels and probability.

davek44 commented 2 years ago

Yes, that workflow looks correct to me. For your remaining NAs, I'd recommend double checking that your VCF and FASTA file given to basenji_sad.py match. Happy to look at it for you, too.

sukritikumar4 commented 2 years ago

Ah that makes sense, I will check and match the FASTA against the VCF file I have for the three variants. Additionally, I wanted to confirm that the VCF should be aligned with the GrCh37 version and not the GrCh38 version. Would that be correct? (if not, I can imagine that would be the solution to all my issues, I suppose.)

Thank you so much for your recommendations and Happy Easter weekend! :)

davek44 commented 2 years ago

You should specify the FASTA for the human genome version you want here: https://github.com/calico/basenji/blob/master/bin/basenji_sad.py#L51