OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Zero shot evaluation #27

Closed Muedi closed 10 months ago

Muedi commented 10 months ago

Hi,

as discussed with @pascalnotin, I added esm to the modeling/models folder. I then added the prediction scriupt of the esm github repo to my eval script and changed the argparse stuff etc.

However, I was not able to test this and I am stuck a bit, therefore I wanted to get some input here:

When I tried to run the script, at first the imports themselves would already throw an error, warning me opf possible circular imports:

AttributeError                            Traceback (most recent call last)
Cell In[26], line 1
----> 1 import protein_lm.modeling.models.esm as esm

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\__init__.py:10](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/__init__.py:10)
      [8]([path-to-basefolder]/esm/__init__.py?line=7) from .data import Alphabet, BatchConverter, FastaBatchedDataset  # noqa
      [9]([path-to-basefolder]/esm/__init__.py?line=8) from .model.esm1 import ProteinBertModel  # noqa
---> [10]([path-to-basefolder]/esm/__init__.py?line=9) from .model.esm2 import ESM2  # noqa
     [11]([path-to-basefolder]/esm/__init__.py?line=10) from .model.msa_transformer import MSATransformer  #noqa
     [12]([path-to-basefolder]/esm/__init__.py?line=11) from . import pretrained  # noqa

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\model\esm2.py:14](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/model/esm2.py:14)
     [10]([path-to-basefolder]/esm/model/esm2.py?line=9) import protein_lm.modeling.models.esm as esm
     [11]([path-to-basefolder]/esm/model/esm2.py?line=10) from ..modules import ContactPredictionHead, ESM1bLayerNorm, RobertaLMHead, TransformerLayer
---> [14]([path-to-basefolder]/esm/model/esm2.py?line=13) class ESM2(nn.Module):
     [15]([path-to-basefolder]/esm/model/esm2.py?line=14)     def __init__(
     [16]([path-to-basefolder]/esm/model/esm2.py?line=15)         self,
     [17]([path-to-basefolder]/esm/model/esm2.py?line=16)         num_layers: int = 33,
   (...)
     [21]([path-to-basefolder]/esm/model/esm2.py?line=20)         token_dropout: bool = True,
     [22]([path-to-basefolder]/esm/model/esm2.py?line=21)     ):
     [23]([path-to-basefolder]/esm/model/esm2.py?line=22)         super().__init__()

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\model\esm2.py:20](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/model/esm2.py:20), in ESM2()
...
     [22]([path-to-basefolder]/esm/model/esm2.py?line=21)     ):
     [23]([path-to-basefolder]/esm/model/esm2.py?line=22)         super().__init__()
     [24]([path-to-basefolder]/esm/model/esm2.py?line=23)         self.num_layers = num_layers

AttributeError: partially initialized module 'protein_lm.modeling.models.esm' has no attribute 'data' (most likely due to a circular import)

However the commit ce4702b46769b458bb309c1feaaa8ce1b90449ad contains next to my changes to the eval script itself, mutliple changes to esms codebase, trying to mitigate the problem with the circular imports and adding imports that correspond to the folder structure.

All imports eem to work well now, but I still get an import Error when I am trying to call a function from pretrained.py:

AttributeError                            Traceback (most recent call last)
[c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\evaluation\scripts.py\Protein-gym.py](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/evaluation/scripts.py/Protein-gym.py) in line 139
    [311]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=310) # inference for each model
    [312]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=311) # set checkpoint to be mnodel location for now
    [313]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=312) model_location = checkpoint.split("/")[-1]
--> [315]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=314) model, alphabet = pretrained.load_model_and_alphabet(model_location)
    [316]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=315) model.eval()
    [317]([path-to-basefolder]evaluation/scripts.py/Protein-gym.py?line=316) if torch.cuda.is_available() and not nogpu:

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\pretrained.py:28](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/pretrained.py:28), in load_model_and_alphabet(model_name)
     [26]([path-to-basefolder]/esm/pretrained.py?line=25)     return load_model_and_alphabet_local(model_name)
     [27]([path-to-basefolder]/esm/pretrained.py?line=26) else:
---> [28]([path-to-basefolder]/esm/pretrained.py?line=27)     return load_model_and_alphabet_hub(model_name)

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\pretrained.py:64](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/pretrained.py:64), in load_model_and_alphabet_hub(model_name)
     [62]([path-to-basefolder]/esm/pretrained.py?line=61) def load_model_and_alphabet_hub(model_name):
     [63]([path-to-basefolder]/esm/pretrained.py?line=62)     model_data, regression_data = _download_model_and_regression_data(model_name)
---> [64]([path-to-basefolder]/esm/pretrained.py?line=63)     return load_model_and_alphabet_core(model_name, model_data, regression_data)

File [c:\Users\maxsp\Work\protein-lm-scaling\protein_lm\modeling\models\esm\pretrained.py:191](file:///C:/Users/maxsp/Work/protein-lm-scaling/protein_lm/modeling/models/esm/pretrained.py:191), in load_model_and_alphabet_core(model_name, model_data, regression_data)
    [188]([path-to-basefolder]/esm/pretrained.py?line=187)     model_data["model"].update(regression_data["model"])
    [190]([path-to-basefolder]/esm/pretrained.py?line=189) if model_name.startswith("esm2"):
--> [191]([path-to-basefolder]/esm/pretrained.py?line=190)     model, alphabet, model_state = _load_model_and_alphabet_core_v2(model_data)
    [192]([path-to-basefolder]/esm/pretrained.py?line=191) else:
...
    [181]([path-to-basefolder]/esm/pretrained.py?line=180)     token_dropout=cfg.token_dropout,
    [182]([path-to-basefolder]/esm/pretrained.py?line=181) )
    [183]([path-to-basefolder]/esm/pretrained.py?line=182) return model, alphabet, state_dict

AttributeError: module 'protein_lm.modeling.models.esm' has no attribute 'data'

Did you run into similar Problems @pascalnotin?

I'd also like to hear @jamaliki 's input and of everyone else reading this of course :)

jamaliki commented 10 months ago

@Muedi If I had to bet, it would be because you named it esm. Try any other name, like fair_esm or something. This is because the Facebook ESM pip package is also called esm

Also, I now see that we are adding all the files from the esm package into this repo. Is there a reason for that? Can't we just have esm as a dependency?

Muedi commented 10 months ago

The plan was to axe down on it when it runs to make a minimal version.

However, I also asked myself if we could just use the pip package.

The issue was indeed the name! Which is weird because I do not have installed the esm package as far as I know (at least niot in the conda I use).

I missed a few args, so I need to change these. I am leaving this open for the discussion about esm pip vs minimal version in folder structure.

talkhanz commented 10 months ago

IMO, I think we should start off with a pip version and if we have additional requirements, we can think about a minimal version that @Muedi is referring to.

Muedi commented 10 months ago

it works with the puip packge no prob. I'll try to find a fix for the multi mutant problem and make another pull rquest :)