calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

Issue with final matrix? #188

Open 16nbb1 opened 10 months ago

16nbb1 commented 10 months ago

Hi! I'm trying to reproduce the prediction matrices in the tutorial.

import os
import json
import subprocess
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pysam
from basenji import dataset, dna_io, seqnn
import tensorflow as tf

I then loaded in your model

model_dir = '/Users/nadejdaboev/Desktop/Sushant/Akita/'
params_file = model_dir+'params_og.json'
model_file  = model_dir+'model_best.h5'
with open(params_file) as params_open:
    params = json.load(params_open)
    params_model = params['model']
    params_train = params['train']
seqnn_model = seqnn.SeqNN(params_model)

I unfortantely could not install and import cooltools. This is my workaround for that issue. I copied the set_diag function directly from cooltools source code

def set_diag(arr, x, i=0, copy=False):
    if copy:
        arr = arr.copy()
    start = max(i, -arr.shape[1] * i)
    stop = max(0, (arr.shape[1] - i)) * arr.shape[1]
    step = arr.shape[1] + 1
    arr.flat[start:stop:step] = x
    return arr

def from_upper_triu(vector_repr, matrix_len, num_diags):
    z = np.zeros((matrix_len,matrix_len))
    triu_tup = np.triu_indices(matrix_len,num_diags)
    z[triu_tup] = vector_repr
    for i in range(-num_diags+1,num_diags):
        set_diag(z, np.nan, i)
    return z + z.T

I initially tried to use my own local reference .fasta file for chromosome 15 and tried to segment a region

fasta_open = pysam.Fastafile('/Users/nadejdaboev/Desktop/Sushant/Genome/chr15.fasta')
#fasta_open.references
seq = fasta_open.fetch( 'NC_000015.10', 63281152, 64329728 ).upper()

This didn't end up working at the prediction stage, so I overwrote this sequence using the fasta provided

fasta_open = pysam.Fastafile('/Users/nadejdaboev/Desktop/Sushant/Genome/hg38.ml.fa')
seq = fasta_open.fetch( 'chr15', 63281152, 64329728 ).upper()
seq_1hot = dna_io.dna_1hot(seq)
test_pred_from_seq = seqnn_model.model.predict(np.expand_dims(seq_1hot,0))

When I print out a bit of the seq I get: CAAAAACAAAAACTCCCTTCTGACCGCTGCCTTACTCAAG.... the resulting seq_1hot aligns with this sequence.

I then used the following, using the values from the json, to set up the reshaping of the array

target_length = 99681
target_length1 = 1048576 // 2048
target_crop = 65536 // 2048
target_length1_cropped = target_length1 - 2*target_crop
hic_diags =  2

This is consistently outputting the following, as I was expecting: flattened representation length: 99681 symmetrix matrix size: (448,448)

I then tried to visualize

plt.figure(figsize=(8,4))
target_index = 0
vmin=-2
vmax=2
mat = from_upper_triu(test_pred_from_seq[:,:, 0], target_length1_cropped, hic_diags)
plt.subplot(121) 
im = plt.matshow(mat, fignum=False, cmap= 'RdBu_r', vmax=vmax, vmin=vmin)
plt.colorbar(im, fraction=.04, pad = 0.05, ticks=[-2,-1, 0, 1,2]);
plt.title('chr15: 63281152-64329728')

I've attached the resulting plot I get.

Screen Shot 2024-01-11 at 5 50 04 PM

I initially thought it was from the reshaping/cooltools issue, so I checked mat

mat

The output is as follows and is definitely not what I was expecting:

array([[ nan, nan, -0.22503351, ..., -50.06995773, -50.18245697, -50.29498291], [ nan, nan, nan, ..., -49.95742035, -50.06995773, -50.18245697], [ -0.22503351, nan, nan, ..., -49.84490967, -49.95742035, -50.06995773], ..., [-50.06995773, -49.95742035, -49.84490967, ..., nan, nan, -0.22503351], [-50.18245697, -50.06995773, -49.95742035, ..., nan, nan, nan], [-50.29498291, -50.18245697, -50.06995773, ..., -0.22503351, nan, nan]])

Then I looked at test_pred_from_seq[:,:, 0] itself

test_pred_from_seq[:,:, 0]

array([[-0.2250335 , -0.33755007, -0.450067 , ..., -0.2250335 , -0.33755007, -0.2250335 ]], dtype=float32)

I don't believe this is correct either. Please let me know if I've made any fatal errors. I can't seem to track my mistake.