LisaVdB / TADA

Transcriptional activation domain prediction using deep learning
2 stars 3 forks source link

How can I prepare my data file in order to use the prediction function #1

Closed Richie-rider closed 2 weeks ago

Richie-rider commented 1 month ago
  1. Do I need to train the model?
  2. Do I need to split the TF sequences into 40 amino acids? Or I can directly use a fasta file containing all the TF sequences.
LisaVdB commented 1 month ago

Hi,

  1. The model is trained and can be used as such (you can find the model in data/model-results-notest/checkpoints). It might be beneficial for your research to continue training with additional species-specific data (not provided). If needed, you can choose to train from scratch with your own data (scripts are provided).
  2. Yes, the TF sequences need to be split into tiles (short sequences). There is a script available that we used for this: src/Tiling.py ; in case you have a list of tiles that is not exactly 40 AA, the preprocessing script will extend the tiles to 40AA using the methionine AA or split the sequences.

Hope this helps, Cheers,

WenZhuang1 commented 1 month ago

Hello! Can you provide the command line code?Thank you very much.

LisaVdB commented 1 month ago

Hi, scripts were run in python IDEs and thus command line code is not available.

pomchoi commented 1 month ago

Thank you for greate work! I want to use TADA for predictiong AD in the protein of arabidopsis thaliana that possibly function as a co-activatior. However, I don't know how I prepare activation score needed for prediction.py script. I would be happy if you could tell me。

Best regards

LisaVdB commented 1 month ago

Hi, Thanks for your interest! You do not need the activation score to make predictions. We included the activation score as a column in the predictions.py script because we also evaluated the performance of our independent dataset. You can adjust the script according to your data; here is the script using two input columns, labels and sequences.

from Preprocessing import scale_features_predict
from Preprocessing import create_features
from Preprocessing import split_seq
from Model import create_model
import csv
import pandas as pd
import numpy as np
from pickle import dump, load
np.random.seed(1258)  # for reproducibility

save_file_path = '../data/predictions/'
if not os.path.exists(save_file_path):
    os.mkdir(save_file_path)

with open(save_file_path + 'Evolution_dataset.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    data = []
    for i in csv_reader:
        data.append([i[0], i[1]])
data.pop(0)

labels = [i[0] for i in data]
sequences = [i[1] for i in data]

'''
Calculate features
'''

# Defines the sequence window size and steps (stride length). Change values if needed.
SEQUENCE_WINDOW = 5
STEPS = 1
LENGTH = 40

features = create_features(sequences, SEQUENCE_WINDOW, STEPS)
features_scaled = scale_features_predict(features)

# Save the features
dump(features_scaled, open(save_file_path + 'features_scaled.pkl', 'wb'))

#When features are already generated - uncomment if needed
#features_scaled = load(open(save_file_path + 'features_TFvalidation.pkl', 'rb'))

'''
Load model
'''

model = create_model(SHAPE = (36, 42))
print('\x1b[2K\tModel created')

model_weights_path = '../data/model-results-notest/checkpoints/'
model.load_weights(model_weights_path + 'tada.14-0.02.hdf5')
print('\x1b[2K\tWeights loaded')

#Make classification predictions
predictions = model.predict(features_scaled)

'''
Save data
'''

data = list(zip(labels, sequences, list(predictions[:,0])))
data = pd.DataFrame(data)
data.columns = ["labels", "sequences", "predictions"]
data.to_csv(save_file_path + "Predictions.csv")
pomchoi commented 1 month ago

thank you very much!!