amosproj / amos2024ss08-cloud-native-llm

MIT License
6 stars 1 forks source link

Research and Select a Suitable NER Tool or Library #44

Closed grayJiaaoLi closed 3 weeks ago

grayJiaaoLi commented 1 month ago

User story

  1. As a data engineer
  2. I want to enhance the quality of our training data
  3. So that the the model can understand better the CNCF-specific knowledge

Acceptance criteria

DoD general criteria

dnsch commented 1 month ago

Report on NER Tool/Library Research Findings

Our industry partner demands Named Entity Recognition (NER) on the gathered data hence we need to find suitable existing tools/approaches to achieve that.

Findings

  1. This article (https://medium.com/@vkrntkmrsngh custom-named-entity-recognition-a-solution-for-unstructured-product-data-aa2372eece04) suggests to use an existing LLM (preferably one that is trained on following instructions) with correct prompts. Models that should be analyzed/tested for this purporse include:
  2. Another option would be to use existing solutions, such as python packages Natural Language Toolkit (NLTK) (https://github.com/nltk/nltk) [Apache-2.0 license] or SpaCy (https://github.com/explosion/spaCy) [MIT license] as described in these articles (https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da, https://fouadroumieh.medium.com/nlp-entity-extraction-ner-using-python-nltk-68649e65e54b)

Conclusion

There are several options that need to be tested to arrive at the best possible solution. As a first start, and if provided resources allow for it, using an LLM for the NER task seems promising and should be further investigated.