[x] how to manage nltk for github runtime machine when tests are run ?
[ ] currently the filebuffer collects the content of a file before sending it for processing. It is able to know if the filename has changed but it keeps on consuming the input_queue. I think the first chunk of a new file is getting lost via this approach ?
[x] right now in the contextual embedding extraction class, it is trying to reduce the dimension of context from 768 -> 10 using umap. The issue is umap needs to build embedding context before it can reduce the dimensions, right now it is taking the context of all the entities found -> fitting it, reducing it and then reducing dimension for each entity. We may have to try different approaches to find the bets way? maybve instead of all entities we allow it build context from the embeddings of the entire sentence ?
[ ] in the extract_entities_from_sentence function of ner_llm class, it is currently using combine_entities_wordpiece func to account for wordpiece tokenizer used by BERT models. Have defined another func combine_entities_byteencoding to account for roberta models which use byte encoding tokenizer. Need to modify the function extract_entities_from_sentence to detect the model type or even better tokenizer type and then choose accordingly.
[ ] Implement Universal_NER
[x] Outputs for now - Entity pair + sentence in which the pair is found+ sentence before + sentence after + summary of the entire document
[x] What happens if the number of entity pairs are very high, what measures can we take to find important entity pairs ?
[x] how to manage nltk for github runtime machine when tests are run ?
[ ] currently the filebuffer collects the content of a file before sending it for processing. It is able to know if the filename has changed but it keeps on consuming the input_queue. I think the first chunk of a new file is getting lost via this approach ?
[x] right now in the contextual embedding extraction class, it is trying to reduce the dimension of context from 768 -> 10 using umap. The issue is umap needs to build embedding context before it can reduce the dimensions, right now it is taking the context of all the entities found -> fitting it, reducing it and then reducing dimension for each entity. We may have to try different approaches to find the bets way? maybve instead of all entities we allow it build context from the embeddings of the entire sentence ?
[ ] in the extract_entities_from_sentence function of ner_llm class, it is currently using combine_entities_wordpiece func to account for wordpiece tokenizer used by BERT models. Have defined another func combine_entities_byteencoding to account for roberta models which use byte encoding tokenizer. Need to modify the function extract_entities_from_sentence to detect the model type or even better tokenizer type and then choose accordingly.
[ ] Implement Universal_NER
[x] Outputs for now - Entity pair + sentence in which the pair is found+ sentence before + sentence after + summary of the entire document
[x] What happens if the number of entity pairs are very high, what measures can we take to find important entity pairs ?