Closed biodegradable24 closed 3 years ago
Hello,
Yes, you have to give your protein sequences as an input to one of our models, and it will produce its features.
publishing embedding to all protein sequences in the world is not feasible. For example, the BFD dataset contains billions of protein sequences, and it will take a lot of time to generate embeddings for every single protein in this dataset. Furthermore, it will need terabytes of storage to store it.
Using the embedding output for prediction should be relatively easy if you have basic machine learning background. If you find it hard, then I would recommend checking the following examples: https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTuning-PyTorchLightning-Localization.ipynb https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTuning-PyTorchLightning-MS.ipynb
In the above examples, you can either:
Otherwise, you can simply use one of our models to generate the representation as a NumPy array, store it in your disk. Then load it in your own script. Here is an example for doing that: https://stackoverflow.com/questions/62710872/how-to-store-word-vector-embeddings
Does this mean that these feature extraction colab notebooks basically produce dataset-specific embeddings? or fine-tune the BFD general purpose embedding to a specific dataset?
The feature extraction examples like: https://github.com/agemagician/ProtTrans/blob/master/Embedding/PyTorch/Advanced/ProtBert-BFD.ipynb shows you how to extract the features of any protein that exists in the world, without fine-tuning the model "without changing the model weights"
The fine-tune examples like: https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTuning-PyTorchLightning-Localization.ipynb shows you how to fine-tune the model "change its weight" for a specific use-case/dataset.
Hi, thank you for your amazing work. I have a question about using ProtBert-BFD Embeddings. As far as I understand, the embeddings themselves are not publicly available, right? and the only way to use them is through inference on examples are provided in your code. This makes it really hard to pass it as an embedding matrix to an embedding layer or to pass it as input to any encoder in terms of its dimension. Can you help me on how to achieve this? What I was thinking is to inference an example that contains all the amino acids and manually append their embeddings to a dictionary to be able to use them. can it be that simple? also I figure this doesn't retain their contextuality, so can you think of any other way they can be used while retaining their contextuality