averbis / averbis-python-api

Conveniently access the REST API of Averbis products using Python
Apache License 2.0
12 stars 4 forks source link

HTML analysis broken with AHD 6.18.1? (NoClassDefFound) #146

Open makampf opened 1 year ago

makampf commented 1 year ago

Describe the bug Call of pipeline.analyse_html() from client side fails with error below. (Endpoint used health-discovery/rest/v1/textanalysis/projects/test/pipelin es/discharge/analyseHtml)

Error message Server side log:

2023-07-07T15:25:40,928 ERROR [qtp1416806921-76] [d.a.i.t.l.TextAnalysisLeadServiceImpl]  Callable threw exception!
2023-07-07 15:25:40 java.lang.NoClassDefFoundError: Could not initialize class org.htmlparser.Parser
2023-07-07 15:25:40     at org.apache.uima.ruta.engine.HtmlAnnotator.process(HtmlAnnotator.java:72)
2023-07-07 15:25:40     at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:50)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.lambda$callProcessMethod$3(AnalysisEngineImplBase.java:669)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.withContexts(AnalysisEngineImplBase.java:688)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.callProcessMethod(AnalysisEngineImplBase.java:668)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:387)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:299)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:590)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:422)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:352)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:276)
2023-07-07 15:25:40     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:295)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.util.HtmlTagsAnnotationUtil.convertHtmlToText(HtmlTagsAnnotationUtil.java:33)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.service.CasPreparationService.prepareCasForHtmlProcessing(CasPreparationService.java:81)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.lambda$processHtml$5(PipelineRunnerImpl.java:188)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.processWithPipeline(PipelineRunnerImpl.java:262)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.lambda$processHtml$6(PipelineRunnerImpl.java:189)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.process(PipelineRunnerImpl.java:246)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.processHtml(PipelineRunnerImpl.java:189)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.lambda$createProcessHtmlCallable$2(TextAnalysisLeadServiceImpl.java:128)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.handleResult(TextAnalysisLeadServiceImpl.java:195)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.lambda$createProcessHtmlCallable$3(TextAnalysisLeadServiceImpl.java:136)
2023-07-07 15:25:40     at de.averbis.integration.textanalysis.service.util.PriorityCallable.call(PriorityCallable.java:40)
2023-07-07 15:25:40     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2023-07-07 15:25:40     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2023-07-07 15:25:40     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2023-07-07 15:25:40     at java.lang.Thread.run(Thread.java:750)
2023-07-07 15:25:49 2023-07-07T15:25:49,061 ERROR [qtp1416806921-97] [d.a.i.t.l.TextAnalysisLeadServiceImpl]  Callable threw exception!
2023-07-07 15:25:49 java.lang.NoClassDefFoundError: Could not initialize class org.htmlparser.Parser
2023-07-07 15:25:49     at org.apache.uima.ruta.engine.HtmlAnnotator.process(HtmlAnnotator.java:72)
2023-07-07 15:25:49     at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:50)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.lambda$callProcessMethod$3(AnalysisEngineImplBase.java:669)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.withContexts(AnalysisEngineImplBase.java:688)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.callProcessMethod(AnalysisEngineImplBase.java:668)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:387)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:299)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:590)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:422)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:352)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:276)
2023-07-07 15:25:49     at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:295)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.util.HtmlTagsAnnotationUtil.convertHtmlToText(HtmlTagsAnnotationUtil.java:33)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.service.CasPreparationService.prepareCasForHtmlProcessing(CasPreparationService.java:81)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.lambda$processHtml$5(PipelineRunnerImpl.java:188)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.processWithPipeline(PipelineRunnerImpl.java:262)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.lambda$processHtml$6(PipelineRunnerImpl.java:189)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.process(PipelineRunnerImpl.java:246)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.processor.PipelineRunnerImpl.processHtml(PipelineRunnerImpl.java:189)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.lambda$createProcessHtmlCallable$2(TextAnalysisLeadServiceImpl.java:128)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.handleResult(TextAnalysisLeadServiceImpl.java:195)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.leadservice.TextAnalysisLeadServiceImpl.lambda$createProcessHtmlCallable$3(TextAnalysisLeadServiceImpl.java:136)
2023-07-07 15:25:49     at de.averbis.integration.textanalysis.service.util.PriorityCallable.call(PriorityCallable.java:40)
2023-07-07 15:25:49     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2023-07-07 15:25:49     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2023-07-07 15:25:49     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2023-07-07 15:25:49     at java.lang.Thread.run(Thread.java:750)

Client side log:

"500 Server Error: 'Server Error' for url: 'http://health-discovery-hd:8080/health-discovery/rest/v1/textanalysis/projects/test/pipelin
es/discharge/analyseHtml?annotationTypes=de.averbis.types.health.Diagnosis%2Cde.averbis.types.health.Medication%2Cde.averbis.types.health.DocumentAnnotation%2Cde.medunifreiburg.imbi.mds.extraction.types.Smoking%2
Cde.uklfr.KidneyStoneAnnotator.KidneyStoneInfo&language=de'.\nEndpoint error message is: 'The text analysis finished with an error, the reason is 'Could not initialize class org.htmlparser.Parser'.'") document_id
='DocumentReference/DOC-NUM-1234567' logger='ahd2fhir.utils.resource_handler' exception='Traceback (most recent call last):\n  
File "/app/ahd2fhir/utils/resource_handler.py", line 374, in _perform_text_analysis\n
    return self.pipeline.analyse_html(text, **analyse_args)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/local/lib/python3.11/site-packages/averbis/core/_rest_client.py", line 392, i
n analyse_html\n    return self.project.client._analyse_html(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/local/lib/python3.11/site-packages/averbis/core/_rest_client.py", line 2387, in _analyse_
html\n    response = self.__request_with_json_response(\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/local/lib/python3.11/site-packages/averbis/core/_rest_client.py", line 1775, in __request_w
ith_json_response\n    self.__handle_error(raw_response)\n  File "/usr/local/lib/python3.11/site-packages/averbis/core/_rest_client.py", line 2957, in __handle_error\n    raise RequestException(error_msg)\nreques
ts.exceptions.RequestException: 500 Server Error: \'Server Error\' for url: \'http://health-discovery-hd:8080/health-discovery/rest/v1/textanalysis/projects/test/pipelines/discharge/analyseHtml?annotationTypes=de
.averbis.types.health.Diagnosis%2Cde.averbis.types.health.Medication%2Cde.averbis.types.health.DocumentAnnotation%2Cde.medunifreiburg.imbi.mds.extraction.types.Smoking%2Cde.uklfr.KidneyStoneAnnotator.KidneyStoneI
nfo&language=de\'.\nEndpoint error message is: \'The text analysis finished with an error, the reason is \'Could not initialize class org.htmlparser.Parser\'.\''

Please complete the following information:

UWinch commented 1 year ago

Hi, thanks for reporting this problem. We will look into it and get back to you as soon as possible.

dbuerkle-averbis commented 1 year ago

If you haven't already done so, you can always parse the html yourself and send the relevant text to pipeline.analyze_text(...) Hope this is a feasible workaround for you.

chgl commented 11 months ago

Simply using beautifulsoup as a workaround did work to some extent: https://github.com/miracum/ahd2fhir/blob/master/ahd2fhir/utils/resource_handler.py#L375 but we are now running into

The number of characters [XXXXX] exceeded the configured threshold of [100000]

This isn't entirely related to the issue at hand, but just in case you happen to know where to change that setting :).

As a side-note, the above error results in a 500 HTTP status code, I think one of the 400 status codes would be more appropriate (and allow us to simply discard the attempt instead of retrying).