aspose-pdf / Aspose.PDF-for-Java

Aspose.PDF for Java examples, plugins and showcases
https://products.aspose.com/pdf/java
MIT License
127 stars 132 forks source link

TextFragmentAbsorber cant match phone by pattern #55

Open liurupeng821 opened 1 year ago

liurupeng821 commented 1 year ago

Text in pdf contains embedded fonts,so I can't match the phone by pattern.

https://lagou-zhaopin-fe.lagou.com/activities/20221229/1672295126482.pdf


public static final String PHONE_REG = "(?:(?:1[-\\s]*[3456789][-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1}[-\\s]*\\d{1})|(?:0[1-9]\\d{1,2}[-\\s]*\\d{7,8}))(?!\\d)";
public static void main(String[] args) throws Exception {
    byte[] source = FileUtils.readFileToByteArray(new File("/1672295126482.pdf"));
    if (!getLicense()) {
        throw new Exception("com.aspose.pdf lic ERROR!");
    }
    try (ByteArrayInputStream searchInputStream = new ByteArrayInputStream(source); ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
        Document pdfDoc = new Document(searchInputStream);

        TextSearchOptions textSearchOptions = new TextSearchOptions(true);
        TextEditOptions textEditOptions = new TextEditOptions(0, TextEditOptions.LanguageTransformation.class);
        TextFragmentAbsorber phoneTextFragmentAbsorber = new TextFragmentAbsorber(
                PHONE_REG,
                textSearchOptions,
                textEditOptions);

        PageCollection pages = pdfDoc.getPages();
        Page page = pages.get_Item(1);
        page.accept(phoneTextFragmentAbsorber);

        for (TextFragment textFragment : phoneTextFragmentAbsorber.getTextFragments()) {
            String text = textFragment.getText();
            logger.info("phone: " + text);
        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}
asadalikhan90 commented 1 year ago

@liurupeng821

We are unable to download the linked file here. Can you please create a post in our official support forum along with the sample file? We will definitely test the scenario in our environment and address it accordingly.