LHNCBC / skr_web_python_api

SKR Web API: Python implementation
Other
36 stars 7 forks source link

MetaMap API breaks when special characters (e.g. 'ß') occurs in a word #8

Open KimBenjaminTang opened 1 year ago

KimBenjaminTang commented 1 year ago

Hello, I am trying to let MetaMap process some translated german texts, which include words with the letter 'ß'.

After analyzing why the JSON output breaks, I found out that the character 'ß' seems to cause an error, if it is included in a word (not a standalone character).

Example request:

from skr_web_api import Submission, METAMAP_INTERACTIVE_URL

args = "-AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase -Z 2022AA"
inst = Submission(email, apikey)
inst.init_mm_interactive('This is a test with Straße', args=args)
response = inst.submit()

When I decode the content of the response via response.content.decode(), it returns a broken JSON string (broken, since it does not clsoe at the end and seems cut off):

/dmzfiler/II_Group/MetaMap2020/public_mm/bin/SKRrun.20 /dmzfiler/II_Group/MetaMap2020/public_mm/bin/metamap20.BINARY.Linux --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase
{"AllDocuments":[
{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "silent"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "show_cuis"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "user_input"
         },
         {
           "OptName": "outfile",
           "OptValue": "user_output"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "USER",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": [

Somewhat of fix would be possible by replacing the character 'ß' with 'ss' to avoid this issue, but I am not sure if the results will be the same as with the online version of MetaMap, since words containing 'ß' are not a problem there:

Request:

User Information: fu-sung.kim-benjamin.tang@rwth-aachen.de Run Time: 12/06/2022 06:12:29

MetaMap Version Used: metamap20 MetaMap Options: -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase Knowledge Source Used: 2022AA

Input Text:

This is a test with Straße --

Output:

{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.out",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "bracketed_output"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp"
         },
         {
           "OptName": "outfile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.out"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "inter_12062022_06:12:29_95743_fu-sung.kim-benjamin.tang@rwth-aachen.de_124752701.tmp",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": "This is a test with Straße",
         "UttStartPos": "0",
         "UttLength": "26",
         "Phrases": [
           {
             "PhraseText": "This",
             "SyntaxUnits": [
               {
                 "SyntaxType": "pron",
                 "LexMatch": "this",
                 "InputMatch": "This",
                 "LexCat": "pron",
                 "Tokens": ["this"]
               }],
             "PhraseStartPos": "0",
             "PhraseLength": "4",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "is",
             "SyntaxUnits": [
               {
                 "SyntaxType": "aux",
                 "LexMatch": "is",
                 "InputMatch": "is",
                 "LexCat": "aux",
                 "Tokens": ["is"]
               }],
             "PhraseStartPos": "5",
             "PhraseLength": "2",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "a test with Straße",
             "SyntaxUnits": [
               {
                 "SyntaxType": "det",
                 "LexMatch": "a",
                 "InputMatch": "a",
                 "LexCat": "det",
                 "Tokens": ["a"]
               },
               {
                 "SyntaxType": "head",
                 "LexMatch": "test",
                 "InputMatch": "test",
                 "LexCat": "noun",
                 "Tokens": ["test"]
               },
               {
                 "SyntaxType": "prep",
                 "LexMatch": "with",
                 "InputMatch": "with",
                 "LexCat": "prep",
                 "Tokens": ["with"]
               },
               {
                 "SyntaxType": "mod",
                 "InputMatch": "Straße",
                 "LexCat": "noun",
                 "Tokens": ["straße"]
               }],
             "PhraseStartPos": "8",
             "PhraseLength": "18",
             "Candidates": [],
             "Mappings": [
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0022885",
                     "CandidateMatched": "Laboratory procedures",
                     "CandidatePreferred": "Laboratory Procedures",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0392366",
                     "CandidateMatched": "Tests (qualifier value)",
                     "CandidatePreferred": "Tests (qualifier value)",
                     "MatchedWords": ["test"],
                     "SemTypes": ["inpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0456984",
                     "CandidateMatched": "Test finding",
                     "CandidatePreferred": "Test Result",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbtr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               }]
           }]
       }]
   }
 }
]}

Can this be fixed by adjusting the MetaMap API to match the procedure of the MetaMap Online version?

KimBenjaminTang commented 1 year ago

The same is applicable with other special characters, such as ü,ö,ä.

And I don't exactly know how the strings are being processed, but "Croé T" breaks it too, while "Croé" or "Croe T" pass.


example_text = """Croé T"""
args = "-AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase -Z 2022AA"
inst = Submission(email, apikey)
inst.init_mm_interactive(example_text, args=args)
response = inst.submit()

Breaking here refers to the incomplete JSON at the end, ending on "UttText": [

So this is also fixable by removing the "é" but perhaps it leads in some cases to a loss of valuable information.

KimBenjaminTang commented 1 year ago

It also breaks with the String m² T due to the character ² followed by another character/word. If the string contains the ² at the end with nothing following other than a whitespace, it gets processed: