I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a single patient's medical records. And it worked - amazing!
However, I noticed that the entire set of extracted entities are repeated in the {}-medical-insights-entities.csv file that is generated for each page of the pdf. The file contents look like this ...
Text,Type,Category,Score,BeginOffset,EndOffset
"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180
... there are lots of valid entries and then the entire set repeats.
"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180
... etc. The same pattern occurs in the {}-medical-insights-phi.json file. "Id": 0 through "Id": 48 occur as expected and then they repeat - all within the same enclosing list. This pattern occurs for each of the 1143 pages in "Some random person's" medical record. I have not noticed any other obvious fubars yet.
I would attach the file(s) but (a) it really is PHI and (b) there's A LOT of data.
As this issue is over three years old and impacts the previous version Textractor I will be closing it. Feel free to reopen if the issue persists with Textractor 1.7.5.
I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a single patient's medical records. And it worked - amazing!
However, I noticed that the entire set of extracted entities are repeated in the {}-medical-insights-entities.csv file that is generated for each page of the pdf. The file contents look like this ...
Text,Type,Category,Score,BeginOffset,EndOffset "Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18 68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28 Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50 11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61 917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172 MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180
... there are lots of valid entries and then the entire set repeats.
"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18 68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28 Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50 11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61 917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172 MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180
... etc. The same pattern occurs in the {}-medical-insights-phi.json file. "Id": 0 through "Id": 48 occur as expected and then they repeat - all within the same enclosing list. This pattern occurs for each of the 1143 pages in "Some random person's" medical record. I have not noticed any other obvious fubars yet.
I would attach the file(s) but (a) it really is PHI and (b) there's A LOT of data.