aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
394 stars 143 forks source link

Repeated data in medical-insights-entities.csv and medical-insights-phi.json #23

Closed crashlurks closed 7 months ago

crashlurks commented 3 years ago

I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a single patient's medical records. And it worked - amazing!

However, I noticed that the entire set of extracted entities are repeated in the {}-medical-insights-entities.csv file that is generated for each page of the pdf. The file contents look like this ...

Text,Type,Category,Score,BeginOffset,EndOffset "Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18 68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28 Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50 11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61 917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172 MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... there are lots of valid entries and then the entire set repeats.

"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18 68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28 Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50 11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61 917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172 MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... etc. The same pattern occurs in the {}-medical-insights-phi.json file. "Id": 0 through "Id": 48 occur as expected and then they repeat - all within the same enclosing list. This pattern occurs for each of the 1143 pages in "Some random person's" medical record. I have not noticed any other obvious fubars yet.

I would attach the file(s) but (a) it really is PHI and (b) there's A LOT of data.

joshlevy89 commented 3 years ago

+1

Belval commented 7 months ago

As this issue is over three years old and impacts the previous version Textractor I will be closing it. Feel free to reopen if the issue persists with Textractor 1.7.5.