Computing Embeddings from Transformer?

Merck / BioPhi

BioPhi is an open-source antibody design platform. It features methods for automated antibody humanization (Sapiens), humanness evaluation (OASis) and an interface for computer-assisted antibody sequence design.

https://biophi.dichlab.org/

MIT License

131 stars 44 forks source link

Computing Embeddings from Transformer? #20

Closed nickbhat closed 2 years ago

nickbhat commented 2 years ago

Hello,

Is there any ability to create an API for producing embeddings from input sequences? Users could implement this themselves if #19 turns out to be true. If not, perhaps an embedding API could be exposed without having to make weights publicly available?

Thanks!

prihoda commented 2 years ago

Hi @nickbhat there's a function intended for this: https://github.com/Merck/BioPhi/blob/f764b6188211eb2d4e6b4f091729f5ab89c7e406/biophi/humanization/methods/sapiens/predict.py#L36

However, looks like the return_all_hiddens arg is actually ignored :) Do you want to create a pull request by any chance? I'm always happy to get more contributors.

It would be a simple fix here: https://github.com/Merck/BioPhi/blob/f764b6188211eb2d4e6b4f091729f5ab89c7e406/biophi/humanization/methods/sapiens/roberta.py#L76

Btw, this way we can only get the embeddings, if you also wanted the attention weights, that would require some changes to fairseq code.

nickbhat commented 2 years ago

I'll take a look at this and submit a PR! Always happy to contribute :) I'll get around to it in a couple of weeks, if that's okay.

prihoda commented 2 years ago

Fixed by https://github.com/Merck/BioPhi/pull/21/

Example usage:

from biophi.humanization.methods.sapiens.predict import sapiens_predict_seq

pred, extra = sapiens_predict_seq(
    seq=seq, # seq should be the variable region sequence only
    chain_type='H', # chain type is H or L
    return_all_hiddens=True
)

embeddings_per_layer = extra['inner_states']