MachineLearningLifeScience / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Possible supporting gpn series model? #1

Open HelloWorldLTY opened 1 month ago

HelloWorldLTY commented 1 month ago

Hi, thanks for your great work. It seems that gpn model is not supported (gpn-msa, its new version supports human genome). Would you please consider including it? Thanks.

Traceback (most recent call last):
  File "/gpfs/radev/project/ying_rex/tl688/BEND/run_testgpn.py", line 10, in <module>
    embedder = bend.embedders.GPNEmbedder('songlab/gpn-msa-sapiens')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/BEND/bend/utils/embedders.py", line 63, in __init__
    self.load_model(*args, **kwargs)
  File "/gpfs/radev/project/ying_rex/tl688/BEND/bend/utils/embedders.py", line 128, in load_model
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 913, in from_pretrained
    tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
                                               ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 732, in __getitem__
    model_type = self._reverse_config_mapping[key.__name__]
KeyError: 'GPNRoFormerConfig'

This is the error message.

miguelgondu commented 1 month ago

Tagging the relevant people: @frederikkemarin @fteufel

fteufel commented 1 month ago

Hi,

yes, GPN-MSA is not supported. As indicated in the readme (https://github.com/MachineLearningLifeScience/BEND?tab=readme-ov-file#embedders-overview), the embedder is meant to be used with the A. thaliana/Brassicales models.

My last status was that it's unclear whether GPN-MSA is useful as an embedding model (https://x.com/gsbenegas/status/1727746984055083075). Supporting it in our embedders would be a bit more complicated, as it operates on MSAs, rather than single sequences.

fteufel commented 1 month ago

@frederikkemarin can you replace the repo here with a fork from the main repo in your account, so that it points back correctly and can be synced?

HelloWorldLTY commented 1 month ago

Hi thanks a lot, that makes sense to me. I notice that there is one paper custome trained a GPN for human genome, and I tried that model with BEND which worked for me. Thanks a lot.

https://www.biorxiv.org/content/10.1101/2024.02.29.582810v1.full.pdf