Open jjacke opened 4 years ago
For all but XML documents we use the Apache Tika parser to extract the text. It claims to support a lot of formats: https://tika.apache.org/1.24.1/formats.html But I suspect it has a problem with epub. And it obviously cannot handle DRM protected documents. But I would need to have a closer look to see what's the problem. When I tested it, some documents where handled fine while others were not.
I was surprised to see that I could upload epub files to CATMA – I didn't know that! However, the text seems to be imcomplete – it is cut off somewhere in the middle. What could be the problem here?