NER: missing README.md - Githubissues

aaronkaplan commented 8 months ago

@priamai , @Brandl : was a bit confused by looking at all the new code. It looks - on a first glance very cool - but it's confusing. May I ask for a README.md in the NER/ directory and in the corresponding subdirs?

priamai commented 8 months ago

Yes sure the main idea, is to wrap the entire extraction in a unified interface so that the heuristics, the spacy/flair vanilla and the bert trained will be called and merge the entities in one output. The spacy-llm for instance can be included as one of the extractor, anyway yes I will expand the readme.

Brandl commented 8 months ago

@priamai I tried something similar as in implementing NER in a single class for the web interface. I took inspiration from your code and the database schema is now 1:1 the entity class.

Some added complexities I've come across: 1) Tokenization For creating a dataset, we need to tokenize the input text, I do this with spacy. Tokenization is somewhat closely related to alignment, so we should probably do both in the extractor base class? 2) Alignment: Matching the extracted token to their occurences in text, some models do this themselfs. Spacy-LLM comes with their own alignment logic, but the LLM part is not aware of positioning My naive approach was converting matches to regex, this works somewhat, but there are definitely nuances in this problem. 3) Entity overlap: Spacy strictly forbids overlap though there are some cases where this might occur eg.(malicious (Microsoft Excel) macro) Tag 1:malicious Microsoft Excel macro, Tag 2: Microsoft Excel, other example: US Cyber Command US -> Country, US Cyber Command -> Organisation So while my datastructure is in theory a valid spacy Doc, the overlap prevents me from importing it. We could also try to find a strategy, so we don't allow for overlapping entities

I like your effort, because it's probably the next milestone for the web ui to make extraction more robust. Right now it's basically a ball of ductape with the goal to make it work: https://github.com/aaronkaplan/cti-llm/blob/main/NER/web/apps/ner/ner.py

Maybe we could do a online meeting to talk about architecture and a interface, so we can glue this together later on.

priamai commented 8 months ago

Yes those are very annoying problems.

About 3: I saw the suggestion is to use a Span categorizer like in this example: https://github.com/explosion/projects/tree/v3/experimental/ner_spancat The other approach would be to use multiple instances with non overlapping entities, but as you said this will be a nightmare to manage.

Yes let's have a meeting next Monday at the usual time?

aaronkaplan / cti-llm

NER: missing README.md #11