Annotation with any language

glacierck commented 2 years ago

The regular condition in dtd_parser limits the possibility of annotation of other characters. I tried to extend other characters, and the test was normal. What is the necessity of these restrictions？

hehuan2112 commented 2 years ago

Thank you so much for your feedback. The restrictions for the dtd file are mainly because of the compatibility of MAE, and the DTD format itself (https://en.wikipedia.org/wiki/Document_type_definition). Our implementation for the DTD parsing is just a minimal set.

The dtd file defines the annotation schema and the elements are used for creating XML tags. So usually to avoid any encoding issues, we use ASCII characters in the dtd file. As far as we know, if characters in other languages are used in dtd for values (e.g., list values, string, etc) instead of elements, it should work in the annotation XML files.

It would be great if you could share you sample annotation files and dtd schema here, then we can test and fix any issues.

varna9000 commented 2 years ago

yes, @glacierck please share your fix. I have train texts in Cyrillic letters and the parser doesn't catch the annotations correctly. I found this when tried to export in BIO format e.g. the annotated term is чл. 78а ал. 1 от НК, but in export I got:

чл  B-LAW
.   I-LAW

с   O

EDIT: Actually it might be just the BIO exporter. Other exporters catch the full term correctly. @hehuan2112 can you please advise?

glacierck commented 2 years ago

As shown in the figure, this modification enables me to customize DTDs of other characters @varna9000

This should be a global configuration that supports user injection @hehuan2112

hehuan2112 commented 2 years ago

Thank you for your feedback @varna9000! Yes, I think this issue is because of the default sentencizer (sentence tokenization) algorithm, which splits a document based on "." As the BIO/IOB2 format requires tokens and contextual sentences, we need to find the sentences around the annotated tokens. If the sentence cannot be correctly identified, the converted results won't be correct.

In your case, I guess the чл. and ал. may be similar to some kind of abbreviations such as Dr., Mr. or Mon.. The best way I think can be to update the sentencizer algorithm to support these cases. But as you know, there can be many corner cases. So, what I suggest is having a new feature to custom a list of punctuation characters or words that mark sentence ends or indicate non-sentence ends.

For example, we can add a config panel to input these words, or add them to the annotation schema. In fact, we also plan to upgrade the annotation schema format by using JSON format, which is easier to modify and update.

Any suggestions?

hehuan2112 commented 2 years ago

As shown in the figure, this modification enables me to customize DTDs of other characters @varna9000

This should be a global configuration that supports user injection @hehuan2112

@glacierck Thank you so much! I see your point. Yes, the current DTD regex parsing can only support very limited characters in the schema, as well as the suggested value list. I think that's a major limitation of the current schema format. I think a better solution is to upgrade the annotation schema format by using JSON format, then we don't need to specify the value range for element names or values.

Although we use the DTD format at present, MedTator in fact uses a JSON object of the schema during annotation and other tasks, which is loaded and converted by thedtd_parser. So, we can have an annotation schema in JSON format and it's easier to define tag names and values in other languages.

Any comments?

varna9000 commented 2 years ago

@hehuan2112 yes, a config with sentence end exeptions would be great. Many languages have different exceptions. For example SpaCy has explicit tokenizer_exeptions.py config file for every language which could be used, or even better just allow the user to put a json or plain text file (new line delimited) with the sentence end exceptions.

glacierck commented 2 years ago

Yes, using JSON format is a good solution, but will the workload be too heavy? If so, I suggest using yaml format and retaining its easy offline reading feature. @hehuan2112

hehuan2112 commented 2 years ago

@varna9000 Thank you for the example, I will check SpaCy's implementation and how to improve our algorithm.

hehuan2112 commented 2 years ago

@glacierck I agree, YAML is also a great solution and it is easy to edit and share. I will plan to add it to the feature roadmap.

hehuan2112 commented 1 year ago

Sorry for my late reply. Last year, we added YAML format support in the 1.3.0 release. And all our sample datasets have provided both DTD and YAML format schema.

OHNLP / MedTator

Annotation with any language #7