[AI] Automatic preprocessing of ePIs - Githubissues

Gravitate-Health / mvp-issues

Gateway for issues/discussions/comments regarding MVPs development.

0 stars 0 forks source link

[AI] Automatic preprocessing of ePIs #34

Open joofio opened 7 months ago

joofio commented 7 months ago

"dumb" preprocessor - done @aalonsolopez can you help here? link?
"smarter" preprocessor - start to tackle this

aalonsolopez commented 7 months ago

So this is the first draft of the "dumb" automatic preprocessor. It's based on a Tree Search Algorithm to search for certain texts, which are terminologies, but this makes it way faster than its planned AI version. This has been deployed on the dev server for months, so you can use it now. You can see the available preprocessor here

joofio commented 7 months ago

so, if i am not mistaken, this should work?

### preprocessing dumb

POST https://gravitate-health.lst.tfo.upm.es/focusing/focus/bundlepackageleaflet-es-da0fc2395ce219262dfd4f0c9a9f72e1?preprocessors=preprocessing-service-mvp2&lenses=lens-selector-mvp2_pregnancy&patientIdentifier=alicia-1

returning

HTTP/1.1 503 Service Unavailable
content-length: 145
content-type: text/plain
date: Mon, 15 Apr 2024 09:05:00 GMT
server: istio-envoy
x-envoy-upstream-service-time: 7229
connection: close

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111

aalonsolopez commented 7 months ago

This is kinda weird, let me check

aalonsolopez commented 7 months ago

Sorry I'm looking this now

aalonsolopez commented 7 months ago

@joofio Better for testing purposes use

POST https://gravitate-health.lst.tfo.upm.es/focusing/preprocessing/bundlepackageleaflet-es-da0fc2395ce219262dfd4f0c9a9f72e1?preprocessors=preprocessing-service-mvp2

joofio commented 7 months ago

this works for every raw epi right? doesnt work with "Accept: application/json" correct?

aalonsolopez commented 6 months ago

its working now as expected with extensions included

aalonsolopez commented 6 months ago

POST https://gravitate-health.lst.tfo.upm.es/focusing/preprocessing/bundlepackageleaflet-es-da0fc2395ce219262dfd4f0c9a9f72e1?preprocessors=preprocessing-service-mvp2

Same endpoint

joofio commented 6 months ago

so i have some questions still;

tried with a raw epi (bundlepackageleafletxyntha) and returns gh-focusing-warnings: {"preprocessingWarnings":[{"serviceName":"preprocessing-service-mvp2","error":"Preprocessed version of ePI could not be handled by preprocessor."}],"lensesWarnings":[]}
what categories is the preprocessor applying ? I see a lot of codes for pregancy but just that (the ones i could test)
when the error occurs, tihe "Accept:" is taken into account? I remove it and still receive a json

aalonsolopez commented 6 months ago

Answers:

I will check it to have an answer ASAP
A (very) short version of some SNOMED codes (https://github.com/Gravitate-Health/terminology-service/blob/testing-simplified-terminologies/controllers/db/Simplification.csv)
No, the step of looking at the Accept header is only visited if everything goes well.

aalonsolopez commented 6 months ago

PS: preprocessing with bundlepackageleafletxyntha works for me

aalonsolopez commented 6 months ago

(sorry i closed this on error)

joofio commented 4 months ago

new list of requirements for this based on discussed today : 4/7/2024

use the base words / code already built-in preprocessor - but enlarge it and improve it (to be done by TS and me)
the codes are checked against terminology server for translations and synonims
possibly check other places and formats for other synonyms
check the resulting words, expressions and acronyms in text (regex, text distance, etc)
tag the whole sentence, paragraph and/or section

from this ,i can envision the following list of requirements in order of importance:

check all words/code in the text
produce a compliant preprocessed epi

notes: equal concepts can be stored inside the same codeableconcept. So everything related to pregnancy is attached to a single class name and 1 or more codes.
check performance (use cache, or whatever is needed to get the preprocessor fast enough ~ <5s)
check the terminology server for synonyms and translations
use the codes stated in the csv on demand and live (or mostly live), when i update the csv, the preprocessor will take them into account
use another method for synynomins and acronyms
use a method for selecting if a sentence, paragraph or section should be highlihted.
logs

@aalonsolopez @amedranogil something i might have forgotten? havent tried the current preproc but will do asap and update this if needed.

amedranogil commented 4 months ago

we can turn this list into Issues in the preprocessor repo, so we can track the progress and discuss each point.

joofio commented 4 months ago

i wanted to test the current preproc before that. give me a day or so

joofio commented 4 months ago

so i tested with the current preprocessor. It lacks in terms that it founds (and some terms dont seem usefull (like Possible?) , and adds a lot for the same concept. example:

            {
              "url": "elementClass",
              "valueString": "Pregnancy"
            },
            {
              "url": "concept",
              "valueCodeableReference": {
                "concept": {
                  "coding": [
                    {
                      "system": "http://snomed.info/sct",
                      "code": "11082009",
                      "display": "Pregnancy"
                    }
                  ]
                }
              }
            }
          ],
          [
            {
              "url": "elementClass",
              "valueString": "Pregnancy"
            },
            {
              "url": "concept",
              "valueCodeableReference": {
                "concept": {
                  "coding": [
                    {
                      "system": "http://snomed.info/sct",
                      "code": "416413003",
                      "display": "Pregnancy"
                    }
                  ]
                }
              }
            }
          ],

This not only creates a ton of different extensions for no reason and the display is not as in the Code System. CodeableConcept in FHIR is 1..* which means it can store several codes for the same concept.

So, taking the example above, it should look like

            {
              "url": "elementClass",
              "valueString": "Pregnancy"
            },
            {
              "url": "concept",
              "valueCodeableReference": {
                "concept": {
                  "coding": [
                    {
                      "system": "http://snomed.info/sct",
                      "code": "11082009",
                      "display": "Pregnancy"
                    },
             {
                      "system": "http://snomed.info/sct",
                      "code": "416413003",
                      "display": "Pregnancy"
                    }
                  ]
                }
              }
            }
          ],

and this assuming they are the same concept. For example in the case above, the code 11082009 is abnormal pregancy and not pregnancy (which is quite different ) https://www.findacode.com/snomed/11082009--abnormal-pregnancy.html also, the 416413003 is Advanced maternal age gravida (which is better but still not ok...) https://www.findacode.com/snomed/416413003--advanced-maternal-age-gravida.html

So, it this, for starters, we need to correct the codes and the idea that the same concepts can be stored inside the same codeableconcept

Other minor stuff:

"code": "1,07605810001191E+016",
"coding": [ { "system": "http://snomed.info/sct", "code": "60001007", "display": "Not" might not be very usefull on its own..

joofio commented 4 months ago

todos from my previous point:

reduce the number of codes related to pregnancy (and others) to just 1 (2,3 max) and enlarge the number of words that it looks for (pregnancy, pregnancies, pregnant, etc) - the original description of the code may help with that. a) i would prefer to have a small number of codes, but tag a lot of usable things
enlarge the code basis and make it core - i want to be able to change the data there and the preprocessor would change ASAP its behaviour (how quick depends on how long it will take to develop) a) in here i want to add several code system (like snomed, gsrs, icpc2, etc)
have a terminology to enlarge codes, translation and relationship between codes. a) related to 1., i would like to have the code with a key concept (like pregnancy and 1 code - and then it would expand into another set of codes and text, words to look for ( feasible? performance wise is achieavable?)

helpfull @aalonsolopez ? let me know