[WIP] Load or download the model.

georgeamccarthy commented 3 years ago

PR type

💔 Breaking Changes - may interfere with Dockerization / have overlap with #30
🏆 Enhancements

Purpose

Allows model and tokenizor to be stored locally & will download if not found.

Why?

Unable to download indexer within flow on GCP. (deployment).

Extra info

New protein_search/models directory to store models in.

models/
└── prot_bert
    ├── model
    │   ├── config.json
    │   └── pytorch_model.bin
    └── tokenizer
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── vocab.txt

Models downloaded from huggingface and then I moved them into these dirs.

Feedback required over

A quick pair of :eyes: on the code
Discussion on the technical approach

Mentions

@fissoreg
@Rubix182

References

https://huggingface.co/transformers/model_doc/bert.html

Legal

[x] I have read and agreed to the terms of contributing.

georgeamccarthy commented 3 years ago

Not sure if gonna merge this but needing it on GCP without the Dockerization merged. Could probs use a simpler model files structure https://huggingface.co/Rostlab/prot_bert/tree/main

georgeamccarthy commented 3 years ago

There may be a simpler to get around the issue. If I try and download the model with a simple script

from transformers import BertModel, BertTokenizer

model_path = "Rostlab/prot_bert"

print("Loading tokenizer.")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
print("loading model.")
model = BertModel.from_pretrained(model_path)

self.tokenizer = tokenizer
self.model = model

print("Done.")

then the system runs out of RAM ~1 GB and throws an error Killed.

To monitor RAM usage ps -m -o %cpu,%mem,command

Instead of downloading the repo I might just be able to configure the download to use a disk cache.

georgeamccarthy / protein_search