Eulring / VANER

Biomedical Named Entity Recognition by LLM
3 stars 1 forks source link

DBR integration #1

Open ferrazzipietro opened 5 months ago

ferrazzipietro commented 5 months ago

Hi, I've gone through your preprint at https://arxiv.org/abs/2404.17835, nice work! Looking at your code here, it is not clear to me what labels you assign at the prepended isntructions at training-time. All I see is the token-based classification finetuning pipeline as provided by https://github.com/4AI/LS-LLaMA, where you prepend the DBR tokens witht the corresponding labels. Could you provide any detail? Thanks for sharing the repo, very interesting

Eulring commented 2 months ago

Thank you for your attention to our work! I apologize for the late reply. I have been quite busy and forgot to check GitHub.

Regarding how the entities used by DBR are collected and constructed, we will provide the code for this as soon as possible.

Here is a brief description of the process:

Step 1: Prepare the knowledge base, we use UMLS and the entities in the training split of the training data.

Step 2: For each training sample that undergoes instruction fine-tuning, we first use the phrase splitting tool AutoPhrase(https://github.com/shangjingbo1226/AutoPhrase) to slice and obtain a series of phrases, and we will also use a common words vocabulary to filter some phrases.

Step 3: Use these phrases as queries to search for related entities in the knowledge base from Step 1, using semantic similarity (https://github.com/cambridgeltl/sapbert) for the search. The search will yield a series of entities that are the same as the categories in the dataset, which will serve as positive samples, and negative samples will also be selected at a certain ratio.