georgeamccarthy / protein_search

The neural search engine for proteins.
GNU Affero General Public License v3.0
15 stars 6 forks source link

[WIP] Load or download the model. #55

Open georgeamccarthy opened 3 years ago

georgeamccarthy commented 3 years ago

PR type

Purpose

Why?

Extra info

New protein_search/models directory to store models in.

models/
└── prot_bert
    ├── model
    │   ├── config.json
    │   └── pytorch_model.bin
    └── tokenizer
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── vocab.txt

Models downloaded from huggingface and then I moved them into these dirs.

Feedback required over

Mentions

References

Legal

georgeamccarthy commented 3 years ago

Not sure if gonna merge this but needing it on GCP without the Dockerization merged. Could probs use a simpler model files structure https://huggingface.co/Rostlab/prot_bert/tree/main

georgeamccarthy commented 3 years ago

There may be a simpler to get around the issue. If I try and download the model with a simple script

from transformers import BertModel, BertTokenizer

model_path = "Rostlab/prot_bert"

print("Loading tokenizer.")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
print("loading model.")
model = BertModel.from_pretrained(model_path)

self.tokenizer = tokenizer
self.model = model

print("Done.")

then the system runs out of RAM ~1 GB and throws an error Killed.

To monitor RAM usage ps -m -o %cpu,%mem,command

Instead of downloading the repo I might just be able to configure the download to use a disk cache.