forTEXT / catma

Computer Assisted Text Markup and Analysis
https://www.catma.de
GNU General Public License v3.0
88 stars 8 forks source link

Upload problem epub files #201

Open jjacke opened 4 years ago

jjacke commented 4 years ago

I was surprised to see that I could upload epub files to CATMA – I didn't know that! However, the text seems to be imcomplete – it is cut off somewhere in the middle. What could be the problem here?

mpetris commented 4 years ago

For all but XML documents we use the Apache Tika parser to extract the text. It claims to support a lot of formats: https://tika.apache.org/1.24.1/formats.html But I suspect it has a problem with epub. And it obviously cannot handle DRM protected documents. But I would need to have a closer look to see what's the problem. When I tested it, some documents where handled fine while others were not.