KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Extract non ascii/unicode text from PDF #148

Open AlmogEinstein opened 3 years ago

AlmogEinstein commented 3 years ago

Hey! I'm trying to extract text from this file using tikaondotnet.extraction. the code is really basic public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }

When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following - WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne

This is the extracted text

I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?

P.S. the English text is extracted fine :)