User story

As a data engineer
I want to enhance the quality of our training data
So that the the model can understand better the CNCF-specific knowledge

Acceptance criteria

Research and select an appropriate NER extraction solution
- Transformers by HuggingFace
- spaCy
- NLTK
- ...
- (maybe existing opened CNCF Entities and their Relationships)
Try to find an approach that can automatically identify entities and relationships using our data
Definition of done (DoD)
Added only after week 5
The same for all features
Here goes the project specific part

DoD general criteria

Feature has been fully implemented
Feature has been merged into the mainline
All acceptance criteria were met
Product owner approved features
All tests are passing
Developers agreed to release

Report on NER Tool/Library Research Findings

Our industry partner demands Named Entity Recognition (NER) on the gathered data hence we need to find suitable existing tools/approaches to achieve that.

Findings

This article (https://medium.com/@vkrntkmrsngh custom-named-entity-recognition-a-solution-for-unstructured-product-data-aa2372eece04) suggests to use an existing LLM (preferably one that is trained on following instructions) with correct prompts. Models that should be analyzed/tested for this purporse include:
- GoLLIE (https://github.com/hitz-zentroa/GoLLIE): Model trained for Information Extraction [Apache-2.0 license]
- XLM-RoBERTa (https://huggingface.co/FacebookAI/xlm-roberta-base): Model trained on filtered CommonCrawl data containing 100 languages [MIT license]
- mDebertaV3 (https://huggingface.co/microsoft/mdeberta-v3-base): Improved version of BERT and RoBERTa models [MIT license]
- UDOP (https://huggingface.co/microsoft/udop-large): Model designed for document image classification, document parsing and document visual question answering [MIT license]
- mPLUG-DocOwl 1.5 (https://github.com/X-PLUG/mPLUG-DocOwl): Model designed for document understanding [Apache-2.0 license]
Another option would be to use existing solutions, such as python packages Natural Language Toolkit (NLTK) (https://github.com/nltk/nltk) [Apache-2.0 license] or SpaCy (https://github.com/explosion/spaCy) [MIT license] as described in these articles (https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da, https://fouadroumieh.medium.com/nlp-entity-extraction-ner-using-python-nltk-68649e65e54b)

Conclusion

There are several options that need to be tested to arrive at the best possible solution. As a first start, and if provided resources allow for it, using an LLM for the NER task seems promising and should be further investigated.

amosproj / amos2024ss08-cloud-native-llm

Research and Select a Suitable NER Tool or Library #44