agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.11k stars 153 forks source link

Using ProtBert-BFD-TF for feature extraction #56

Closed wafaaashraf closed 3 years ago

wafaaashraf commented 3 years ago

In the presented example, for using ProtBert-BFD-TF for feature extraction, the feature matrix's dimensions are [number of example sequences, number of total amino acids in the entire example sequences, 1024*number of total amino acids in the entire example sequences] as per the attached screenshot directly from the presented notebook, while it should be: [number of example sequences, max_sequecnce_length, 1024]. Please elaborate on whether this is a bug image

agemagician commented 3 years ago

The example results are correct. There are two examples, so len(features) = 2 The first example has 5 amino acids and the second is 7 amino acids, so len(features[0]) = 7 and len(features[1]) = 5 Finally, our ProtBert model generation 1024 features for each amino acid input.

The output of the model will be in the following form [batch size, number of amino acids per sample, number of features (1024)]

This example doesn't provide the output probability of each amino acid. However, it does provide a 1024 feature representation for each input amino acid.

wafaaashraf commented 3 years ago

First, I believe it should not be [batch size, number of amino acids per sample, number of features (1024)]. Instead, t should be [batch size, number of amino acids per sequence, number of features (1024)]

Second, the generated features are actually not even [batch size, number of amino acids per sample, number of features (1024)] but rather [batch size, number of amino acids per sample, number of features (1024)*number of amino acids per sample] as you can see from the screenshot.

I believe the only explanation for this is that ProTrans' embeddings are contextualized across the entire input dataset (in your colab = sequence_examples) instead of being contextualized across each sequence. Is this true? If yes, is there a way to make the contextualization across sequence only? I believe passing sequences individually to the tokenizer should achieve this.

agemagician commented 3 years ago

What I meant here : sequence = sample = example

The calculations you have made are not correct.

Assuming you have 2 examples and each one has 6 amino acids, then the output shape will be : [2, 6, 1024] This means there is 2 examples/sequence/sample, each one has 6 amino acids, and each amino acid has 1024 features.

In your for loop, you sum all sequences, all amino acids, and all features across the whole dataset, which is not correct. This is your current calculations: [2, 62, 26*1024] = [2,12, 12288]

If you need to calculate it correctly, you can easily use the NumPy shape method, or reset the counters to zero before you start each nested for loop.