dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
http://doi.org/10.1093/bioinformatics/btz682
Other
1.93k stars 451 forks source link

Cosine Similarity is high when created sentence embeddings for 'asthma' (medical domain) and another with non-medical domain. Why ? #125

Open Sumedh1505 opened 4 years ago

Sumedh1505 commented 4 years ago

Hi Team,

I created embeddings for an abstract of 'asthma' and compared with any other abstract of non-medical domain. The Cosine Similarity was > 0.9

`< This is the Code Snippet I used to create embeddings:

def get_embeddings(df, text_col_name, tokenizer, model):
    df[text_col_name] = df[text_col_name].apply(lambda x: str(unicodedata.normalize('NFKD', x).encode('ascii','ignore'))[2:-1])
    df[text_col_name] = df[text_col_name].apply(lambda x : x if x.endswith('.') else x+'.')
    vectors = []
    for each_row in tqdm(df[text_col_name]):
        indexed_tokens = tokenizer.encode(each_row, add_special_tokens=True, max_length=275)
        indexed_tokens = pad_sequences([indexed_tokens], maxlen=275, truncating='post', padding='post')
        segment_ids = [int(i>0) for i in indexed_tokens[0]]
        tokens_tensor = torch.tensor([indexed_tokens[0].tolist()])
        segment_tensor = torch.tensor([segment_ids])
        model.eval()
        with torch.no_grad():
            outputs = model(tokens_tensor, segment_tensor)
            hidden_states = outputs[2]
        token_vecs = hidden_states[-2][0]
        vectors.append(torch.mean(token_vecs, dim=0))
    return vectors

` Is there something wrong I'm doing unknowingly ?