apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

How are the weights matrices calculated in the aggregated classification module? #64

Closed a-piece-of-teemo closed 5 months ago

a-piece-of-teemo commented 6 months ago

Hello, in the aggregated classification module, how are the weights of the w1 and w2 matrices calculated, and how are the weights and biases of the dense layer calculated? Is there any basis for the calculation method? Can a new weight matrix be calculated? I am eagerly looking forward to your help.

apcamargo commented 6 months ago

You can find the aggregation function and the weights here. The weights were determined via backpropagation, by training a neural network classifier that takes in the total marker frequencies scores generated by both branches. You can try to use the code below to train a new model:

import gctf
import tensorflow as tf
import tensorflow_addons as tfa
from genomad.neural_network.igloo import IGLOO1D_Block, BranchAttention
from categorical_focal_loss import SparseCategoricalFocalLoss
from tensorflow.keras import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import (
    Activation,
    BatchNormalization,
    Dense,
    Input,
)
from tensorflow.keras.optimizers import Adam

def create_attention_model():
    w = Input(name="w", shape=(1))
    b1 = Input(name="b1", shape=(3))
    b2 = Input(name="b2", shape=(3))
    attention = BranchAttention()([w, b1, b2])
    output_layer = Dense(3, activation="softmax")(attention)
    model = Model([w, b1, b2], output_layer)
    opt = Adam(
        learning_rate=7.5e-3,
        decay=1e-4,
    )
    opt.get_gradients = gctf.centralized_gradients_for_optimizer(opt)
    model.compile(
        optimizer=opt,
        loss="sparse_categorical_crossentropy",
        weighted_metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
    )
    return model

es_callback = EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True,
)

with strategy.scope():
    attention_model = create_attention_model()

# with strategy.scope():
#     attention_model = tf.keras.models.load_model(f"final_models/attention_k{K_i}")

attention_model.fit(
    x = [w_train, b1_train, b2_train],
    y = y_train,
    validation_data=([w_val, b1_val, b2_val], y_val),
    epochs=150,
    callbacks=[es_callback],
)
a-piece-of-teemo commented 5 months ago

Hello, I am pleased to receive your reply. May I ask if you could share the dataset used for this weight training? Also, could you please explain how you obtained this training data?

apcamargo commented 5 months ago

You can find the training data here: https://zenodo.org/records/8049246

The sources of this data are described in the manuscript: https://www.nature.com/articles/s41587-023-01953-y

a-piece-of-teemo commented 5 months ago

Thank you for your reply.

However, what I am asking about is the datasets used in this weight matrix training code. For instance, w_train, b1_train, b2_train, y_train, as well as the validation datasets w_val, b1_val, b2_val, y_val—what do these datasets represent? How are they obtained? Would it be possible for you to share them? If there is any misunderstanding on my part regarding this code, please correct me.

I am looking forward to your response.

apcamargo commented 5 months ago

The data can be found in the Zenodo link I sent in the previous commend. The sequences are in benchmark_data/train_test_sequences.fna.gz and you will need to convert them to one-hot-encoded matrices using the code in https://github.com/apcamargo/genomad/blob/main/genomad/sequence.py#L169. In benchmark_data/sequence_weights.tsv you will find the train/test sets for each of the five data splits I used for the benchmark. The validation sets were formed during training consisted of a random sample of 10% of the test data.

a-piece-of-teemo commented 5 months ago

Thank you for your response.

I have another question: which columns in the sequence_weights.tsv file correspond to w, b1, b2, and y respectively?

image

apcamargo commented 5 months ago

Those are not the network weights. Those are the weights used for training the model (refer to the paper to understand why sequences were deferentially weighted).

The model weights can be found at https://github.com/apcamargo/genomad/blob/2bdbdd53d7171145338e7e3b53eefca0204f7120/genomad/modules/aggregated_classification.py#L8-L27

zzzfire commented 4 months ago

You can find the aggregation function and the weights here. The weights were determined via backpropagation, by training a neural network classifier that takes in the total marker frequencies scores generated by both branches. You can try to use the code below to train a new model:

import gctf
import tensorflow as tf
import tensorflow_addons as tfa
from genomad.neural_network.igloo import IGLOO1D_Block, BranchAttention
from categorical_focal_loss import SparseCategoricalFocalLoss
from tensorflow.keras import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import (
    Activation,
    BatchNormalization,
    Dense,
    Input,
)
from tensorflow.keras.optimizers import Adam

def create_attention_model():
    w = Input(name="w", shape=(1))
    b1 = Input(name="b1", shape=(3))
    b2 = Input(name="b2", shape=(3))
    attention = BranchAttention()([w, b1, b2])
    output_layer = Dense(3, activation="softmax")(attention)
    model = Model([w, b1, b2], output_layer)
    opt = Adam(
        learning_rate=7.5e-3,
        decay=1e-4,
    )
    opt.get_gradients = gctf.centralized_gradients_for_optimizer(opt)
    model.compile(
        optimizer=opt,
        loss="sparse_categorical_crossentropy",
        weighted_metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
    )
    return model

es_callback = EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True,
)

with strategy.scope():
    attention_model = create_attention_model()

# with strategy.scope():
#     attention_model = tf.keras.models.load_model(f"final_models/attention_k{K_i}")

attention_model.fit(
    x = [w_train, b1_train, b2_train],
    y = y_train,
    validation_data=([w_val, b1_val, b2_val], y_val),
    epochs=150,
    callbacks=[es_callback],
)

Hello author, I have made some modifications to the model, and it seems that the parameters of the weight matrix used for score aggregation might no longer be appropriate, such as w1, w2, weights, and bias. I would like to know if it is possible to train new w1, w2, weights, and bias by using the code you provided. Also, if I want to use your code, does this imply that the input for this code should be npz files containing the prediction scores from two models that accept the same input sequence, and the validation set should be the labels corresponding to the sequence in the npz files?

The above is my idea for using your code to train new parameters of the weight matrix used for score aggregation, such as w1, w2, weights, and bias. If there are any mistakes, please tell me what went wrong and how I should proceed with the training. Thank you for your help, and I am looking forward to your reply.

apcamargo commented 4 months ago

Yes. That sounds about right

zzzfire commented 3 months ago

Hello author,

I am very pleased to receive your reply. I have a few questions that I would like to get your answers to:

1.What data do y-train and y_val refer to, and what is the format of this data? It is not shown in the code, and I cannot understand it.

2.I feel that this training script is missing some content, and there are some parts that are difficult to understand. Do you have a more complete training script for training weights and biases? Could you share it?

Thank you for your help, and I look forward to your reply.