LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
55 stars 14 forks source link

Indexing of UMLS subset #1

Open afsluit opened 5 years ago

afsluit commented 5 years ago

Hi, I am following the Generating Tables section in the Design.md to create the index files for a custome subset of the UMLS. The CreateIndexes program runs without any errors (just the warnings about missing meshtcrelaxed.txt and vars.txt).

However, the irutils.MappedMultiKeyIndexLookup lookup command does not return any results when applied on the generated indexes. When applying the same command on the indices distributed with MetaMapLite, results are returned. When browsing the subset for which the indices are generated with Metamorphosys, the searched term is also found.

If there is any debug information I can provide, I'm more than happy to provide it. Thanks for your help!

afsluit commented 5 years ago

I suppose the problem are umlauts and other characters that need more than one byte to represent. When replacing every non-ASCII character while generating the indices (e.g. in MultiKeyIndex::loadTable) with an ASCII character (e.g. 'x'), the lookup works (obviously not for terms containing the replaced characters, but only for ones containing ascii chars).

willjrogers commented 5 years ago

I've created the branch utf (https://github.com/lhncbc/metamaplite/tree/utf) that supports producing and querying indexes using UTF-8. Almost all of the classes in irutils had to be updated to support UTF-8. The UTF-8 implementation supports the old ASCII indexes. Also added unit tests to check if the irutils library is producing indexes correctly.

What's missing?

  1. Currently no support for generating lexical variants for terms containing UTF-8 for the chunker.
  2. Term normalization needs to updated to support UTF-8 diacritics to ASCII (ö -> o) and possibly the reverse and conversion of some Greek characters in chemicals (α -> alpha).
  3. Some input readers and result formatters need to updated to support reading and writing UTF-8.
  4. A mechanism for querying the index on its encoding needs to be added.

Future, merging initial UTF-8 support with master and expanding support to allow use of both UTF-8 and UTF-16.

stevenbedrick commented 1 year ago

Hello! I have just run into this exact problem, and before filing my own issue thought to check the existing ones. I see that the utf branch is pretty far behind the main branch, has there been any more thought to making "mainstream" MetaMapLite support UTF-8?

For what it's worth, the current state of UTF-8 support on the main branch appears to be that the indexing code does work in that a correctly-formed index is produced and written to disk- the necessary places in the code are all charset-aware. The issue is with lookup, specifically MappedMultiKeyIndex.dictionaryBinarySearch(), line 434, which does not do Unicode-aware String comparison. Oddly, there is a commented-out line that would do a correct comparison, but the actual code that is being executed to do the comparison is looking at raw bytes and so is failing on UTF-8 that involves code points higher than U+007F.

stevenbedrick commented 1 year ago

Huh - on closer inspection, it looks like the line in question (which is 435, not 434, apologies) was actually added as part of commit da51d16e8a5a1869c3f021492719bcd149042b2e, which was about adding UTF-8 support to the indexing part of MML. The commented-out line on 432, which (to my eye and according to my tests) does the correct Unicode-aware string comparison, was commented out as part of this revision in favor of the byte-level comparison whose misbehavior is at the root of this issue.

So this makes me think that I must be misunderstanding the situation. @willjrogers , any thoughts?

willjrogers commented 1 year ago

I believe the document loaders in the package gov.nih.nlm.nls.metamap.document handle loading documents in UTF-8. Unfortunately, the result formatters (in gov.nih.nlm.nls.metamap.document) have not been updated to support UTF-8 except when called by the method MetaMapLite:listEntities which does open the PrintWriter using the UTF-8 charset. When looking up terms, the class irutils.MappedMultiKeyIndex does encode Strings to Bytes using “UTF-8” encoding by default.

Best, Willie Rogers

Willie Rogers, Contractor, Lister Hill Center, National Library of Medicine, NIH email: @.**@.> Rogers, Willie (NIH/NLM/LHC) [C]

From: Steven Bedrick @.> Date: Monday, May 1, 2023 at 8:10 PM To: lhncbc/metamaplite @.> Cc: Rogers, Willie (NIH/NLM/LHC) [C] @.>, Mention @.> Subject: [EXTERNAL] Re: [lhncbc/metamaplite] Indexing of UMLS subset (#1)

Huh - on closer inspection, it looks like the line in question (which is 435, not 434, apologies) was actually added as part of commit da51d16https://github.com/lhncbc/metamaplite/commit/da51d16e8a5a1869c3f021492719bcd149042b2e, which was about adding UTF-8 support to the indexing part of MML. The commented-out line on 432, which (to my eye and according to my tests) does the correct Unicode-aware string comparison, was commented out as part of this revision in favor of the byte-level comparison whose misbehavior is at the root of this issue.

So this makes me think that I must be misunderstanding the situation. @willjrogershttps://github.com/willjrogers , any thoughts?

— Reply to this email directly, view it on GitHubhttps://github.com/lhncbc/metamaplite/issues/1#issuecomment-1530635927, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHOVZBMWNDJ3FFMGE7WKUR3XEBGGZANCNFSM4FQT26VQ. You are receiving this because you were mentioned.Message ID: @.***> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

stevenbedrick commented 1 year ago

So irutils.MappedMultiKeyIndex does use the UTF-8 byte representation for the lookup key, which works for lookup keys that involve non-ASCII characters, but this leads to the binary search implementation breaking for certain other lookup keys in the same index shard which is causing the problem that I have run into.

It took me quite a while to figure this out, but the short version of the story is that the binary search will fail for keys that both a) share a common UTF-8 byte length as another word in MRCONSO.RRF that involve non-ASCII characters, and b) share a common prefix with that word (that common prefix being the run of text up to the first non-ASCII character in the word).

As you may imagine, this made it rather a tricky bug to track down, as the lookups that were failing did not include any non-ASCII Unicode text, and test strings that did use non-ASCII Unicode characters were working just fine, so that threw me off for while. But at this point I have a minimal MRCONSO.RRF file that will demonstrate this bug, and can confirm that doing the comparison in dictionaryBinarySearch() at the String level rather than the UTF-8 byte level solves the problem. Is there any reason not to do the comparison at the String level?

stevenbedrick commented 1 year ago

Here might be another piece of the puzzle - I'm revisiting the code that controls how the various postings/term dictionary/etc. files are generated, and if I am understanding things correctly, everything depends on the fact that TreeMap allows for sorted iteration of keys. However, again, IIUC, that iteration happens according to the natural sort order as defined by the implementation of Comparable on the keys themselves - in this case, String's compareTo method. This would also explain why doing the comparison for binary search at the byte level (rather than at the string level) is failing - because the files are being created in string-sorted order not UTF-8-byte-sorted order.