J-SNACKKB / FLIP

A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
Academic Free License v3.0
90 stars 14 forks source link

how to benchmark a new PLM model #26

Open wangjiaqi8710 opened 3 weeks ago

wangjiaqi8710 commented 3 weeks ago

we would like to benchmark on the meltome thermostability data with our own PLM model. could you please suggest a way to do so? Thanks in advance.

SebieF commented 3 weeks ago

Hello and thanks for your question! I'd suggest you take a look at biotrainer, a framework that allows to train standardized models on embeddings from any pLM (or even compute the embeddings if your model is available publicly on huggingface). Benchmarks for the FLIP dataset were trained with it using the following scripts, which you might find useful in addition: https://github.com/J-SNACKKB/autoeval

wangjiaqi8710 commented 3 weeks ago

Thank you very much for the timely reply.

JQ

wangjiaqi8710 commented 3 weeks ago

Hello and thanks for your question! I'd suggest you take a look at biotrainer, a framework that allows to train standardized models on embeddings from any pLM (or even compute the embeddings if your model is available publicly on huggingface). Benchmarks for the FLIP dataset were trained with it using the following scripts, which you might find useful in addition: https://github.com/J-SNACKKB/autoeval

A further question. Is it possible to use biotrainer for structure-aware PLMs, such as SaProt (https://github.com/westlake-repl/SaProt?tab=readme-ov-file)? In the SaProt model, the protein is represented with both sequence and structural vocabulary, so the sequence appears as "MdEvVpQpLrVyQdYaKv", with the minor case representing the conformation of the corresponding residue. Thanks advance for your help.

SebieF commented 3 weeks ago

Thanks for asking, I checked it today and only had to implement a small change (https://github.com/sacdallago/biotrainer/pull/105) to make the model work. While tokenizer.tokenize(seq) works as expected within the biotrainer embeddings calculation, I am not sure if the tokens used for embedding are correct, because we use the following function for tokenization, which does not match the one provided by the SaProt example:

    def _tokenize(self, batch: List[str]) -> Tuple[torch.tensor, torch.tensor]:
        ids = self._tokenizer.batch_encode_plus(batch, add_special_tokens=True,
                                                is_split_into_words=False,
                                                padding="longest")

        tokenized_sequences = torch.tensor(ids["input_ids"]).to(self._model.device)
        attention_mask = torch.tensor(ids["attention_mask"]).to(self._model.device)
        return tokenized_sequences, attention_mask

So, I would very much appreciate if you try to use the SaProt model for your use case, you can either use the develop branch if you want to try it immediately (https://github.com/sacdallago/biotrainer/tree/develop) or wait for a new release of biotrainer (hopefully this week).

This is the config I used for testing:

sequence_file: sequences.fasta
protocol: sequence_to_class
model_choice: FNN
optimizer_choice: adam
loss_choice: cross_entropy_loss
num_epochs: 200
use_class_weights: True
learning_rate: 1e-3
batch_size: 128
save_split_ids: False
use_half_precision: True
device: cuda
disable_pytorch_compile: False
embedder_name: Takagi-san/SaProt_650M_AF2

This is the example sequence fasta file:

>Seq1 TARGET=Glob SET=train
MdEvVpQpLrVyQdYaKvKa
>Seq2 TARGET=GlobSP SET=val
MdEvVpQpLrVyQdYaKvYa
>Seq3 TARGET=TM SET=test
MdEvVpQpLrVyQdYaKvMa
>Seq4 TARGET=TMSP SET=test
MdEvVpQpLrVyQdYaKvEv

If it works for you and embeddings are calculated as expected, we will consider to add the model and special sequence vocabulary as an example to biotrainer, so any feedback is heartly welcome :)

wangjiaqi8710 commented 3 weeks ago

Thanks for asking, I checked it today and only had to implement a small change (sacdallago/biotrainer#105) to make the model work. While tokenizer.tokenize(seq) works as expected within the biotrainer embeddings calculation, I am not sure if the tokens used for embedding are correct, because we use the following function for tokenization, which does not match the one provided by the SaProt example:

    def _tokenize(self, batch: List[str]) -> Tuple[torch.tensor, torch.tensor]:
        ids = self._tokenizer.batch_encode_plus(batch, add_special_tokens=True,
                                                is_split_into_words=False,
                                                padding="longest")

        tokenized_sequences = torch.tensor(ids["input_ids"]).to(self._model.device)
        attention_mask = torch.tensor(ids["attention_mask"]).to(self._model.device)
        return tokenized_sequences, attention_mask

So, I would very much appreciate if you try to use the SaProt model for your use case, you can either use the develop branch if you want to try it immediately (https://github.com/sacdallago/biotrainer/tree/develop) or wait for a new release of biotrainer (hopefully this week).

This is the config I used for testing:

sequence_file: sequences.fasta
protocol: sequence_to_class
model_choice: FNN
optimizer_choice: adam
loss_choice: cross_entropy_loss
num_epochs: 200
use_class_weights: True
learning_rate: 1e-3
batch_size: 128
save_split_ids: False
use_half_precision: True
device: cuda
disable_pytorch_compile: False
embedder_name: Takagi-san/SaProt_650M_AF2

This is the example sequence fasta file:

>Seq1 TARGET=Glob SET=train
MdEvVpQpLrVyQdYaKvKa
>Seq2 TARGET=GlobSP SET=val
MdEvVpQpLrVyQdYaKvYa
>Seq3 TARGET=TM SET=test
MdEvVpQpLrVyQdYaKvMa
>Seq4 TARGET=TMSP SET=test
MdEvVpQpLrVyQdYaKvEv

If it works for you and embeddings are calculated as expected, we will consider to add the model and special sequence vocabulary as an example to biotrainer, so any feedback is heartly welcome :)

The embeddings work as expected. Thank you.