Hey!
I'm trying to extract text from this file using tikaondotnet.extraction.
the code is really basic
public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }
When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following -
WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOneWARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOneWARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOneWARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOneWARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne
I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?
Hey! I'm trying to extract text from this file using
tikaondotnet.extraction
. the code is really basicpublic static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }
When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following -
WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne
This is the extracted text
I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?
P.S. the English text is extracted fine :)