Open wangjiaqi8710 opened 3 weeks ago
Hello and thanks for your question! I'd suggest you take a look at biotrainer, a framework that allows to train standardized models on embeddings from any pLM (or even compute the embeddings if your model is available publicly on huggingface). Benchmarks for the FLIP dataset were trained with it using the following scripts, which you might find useful in addition: https://github.com/J-SNACKKB/autoeval
Thank you very much for the timely reply.
JQ
Hello and thanks for your question! I'd suggest you take a look at biotrainer, a framework that allows to train standardized models on embeddings from any pLM (or even compute the embeddings if your model is available publicly on huggingface). Benchmarks for the FLIP dataset were trained with it using the following scripts, which you might find useful in addition: https://github.com/J-SNACKKB/autoeval
A further question. Is it possible to use biotrainer for structure-aware PLMs, such as SaProt (https://github.com/westlake-repl/SaProt?tab=readme-ov-file)? In the SaProt model, the protein is represented with both sequence and structural vocabulary, so the sequence appears as "MdEvVpQpLrVyQdYaKv", with the minor case representing the conformation of the corresponding residue. Thanks advance for your help.
Thanks for asking, I checked it today and only had to implement a small change (https://github.com/sacdallago/biotrainer/pull/105) to make the model work. While tokenizer.tokenize(seq)
works as expected within the biotrainer embeddings calculation, I am not sure if the tokens used for embedding are correct, because we use the following function for tokenization, which does not match the one provided by the SaProt example:
def _tokenize(self, batch: List[str]) -> Tuple[torch.tensor, torch.tensor]:
ids = self._tokenizer.batch_encode_plus(batch, add_special_tokens=True,
is_split_into_words=False,
padding="longest")
tokenized_sequences = torch.tensor(ids["input_ids"]).to(self._model.device)
attention_mask = torch.tensor(ids["attention_mask"]).to(self._model.device)
return tokenized_sequences, attention_mask
So, I would very much appreciate if you try to use the SaProt model for your use case, you can either use the develop branch if you want to try it immediately (https://github.com/sacdallago/biotrainer/tree/develop) or wait for a new release of biotrainer (hopefully this week).
This is the config I used for testing:
sequence_file: sequences.fasta
protocol: sequence_to_class
model_choice: FNN
optimizer_choice: adam
loss_choice: cross_entropy_loss
num_epochs: 200
use_class_weights: True
learning_rate: 1e-3
batch_size: 128
save_split_ids: False
use_half_precision: True
device: cuda
disable_pytorch_compile: False
embedder_name: Takagi-san/SaProt_650M_AF2
This is the example sequence fasta file:
>Seq1 TARGET=Glob SET=train
MdEvVpQpLrVyQdYaKvKa
>Seq2 TARGET=GlobSP SET=val
MdEvVpQpLrVyQdYaKvYa
>Seq3 TARGET=TM SET=test
MdEvVpQpLrVyQdYaKvMa
>Seq4 TARGET=TMSP SET=test
MdEvVpQpLrVyQdYaKvEv
If it works for you and embeddings are calculated as expected, we will consider to add the model and special sequence vocabulary as an example to biotrainer, so any feedback is heartly welcome :)
Thanks for asking, I checked it today and only had to implement a small change (sacdallago/biotrainer#105) to make the model work. While
tokenizer.tokenize(seq)
works as expected within the biotrainer embeddings calculation, I am not sure if the tokens used for embedding are correct, because we use the following function for tokenization, which does not match the one provided by the SaProt example:def _tokenize(self, batch: List[str]) -> Tuple[torch.tensor, torch.tensor]: ids = self._tokenizer.batch_encode_plus(batch, add_special_tokens=True, is_split_into_words=False, padding="longest") tokenized_sequences = torch.tensor(ids["input_ids"]).to(self._model.device) attention_mask = torch.tensor(ids["attention_mask"]).to(self._model.device) return tokenized_sequences, attention_mask
So, I would very much appreciate if you try to use the SaProt model for your use case, you can either use the develop branch if you want to try it immediately (https://github.com/sacdallago/biotrainer/tree/develop) or wait for a new release of biotrainer (hopefully this week).
This is the config I used for testing:
sequence_file: sequences.fasta protocol: sequence_to_class model_choice: FNN optimizer_choice: adam loss_choice: cross_entropy_loss num_epochs: 200 use_class_weights: True learning_rate: 1e-3 batch_size: 128 save_split_ids: False use_half_precision: True device: cuda disable_pytorch_compile: False embedder_name: Takagi-san/SaProt_650M_AF2
This is the example sequence fasta file:
>Seq1 TARGET=Glob SET=train MdEvVpQpLrVyQdYaKvKa >Seq2 TARGET=GlobSP SET=val MdEvVpQpLrVyQdYaKvYa >Seq3 TARGET=TM SET=test MdEvVpQpLrVyQdYaKvMa >Seq4 TARGET=TMSP SET=test MdEvVpQpLrVyQdYaKvEv
If it works for you and embeddings are calculated as expected, we will consider to add the model and special sequence vocabulary as an example to biotrainer, so any feedback is heartly welcome :)
The embeddings work as expected. Thank you.
we would like to benchmark on the meltome thermostability data with our own PLM model. could you please suggest a way to do so? Thanks in advance.