ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
341 stars 52 forks source link

`PdfPageText::search` returns 0 results for `text_search` example (WASM) #154

Closed synoet closed 1 month ago

synoet commented 1 month ago

I'm encountering an issue with the PdfPageText::search method when compiling to wasm and linking against recent wasm versions of pdfium-lib releases.(I've tried from V5668 to 6276)

Steps to reproduce:

  1. Compiled pdfium-render linked against any recent wasm pdfium-lib release
  2. Copying the logic from text_search example with only modification being passing in the pdf buffer through javascript.
  3. Returns 0 search results.

When compiling on linux, linked against [pdfium-binaries aur arch linux] , the text_search example returns the correct number of results.

Not sure if this is the best place to ask, might be an issue with the recent wasm pdfium-lib releases and not with pdfium-render, but thought I would bring it up If you might have any ideas as to why its not working as expected.

Any insights or suggestions would be greatly appreciated !

ajrcarey commented 1 month ago

Hi @synoet , thank you for reporting the issue. I can reproduce the problem using the following WASM function based on the text_search example:

#[cfg(target_arch = "wasm32")]
#[wasm_bindgen]
pub async fn text_search_test(url: String) {
    // For general comments about pdfium-render and binding to Pdfium, see export.rs.

    let search_term = "French";

    let search_options = PdfSearchOptions::new()
        // Experiment with how the search results change when uncommenting
        // the following search options.

        // .match_whole_word(true)
        // .match_case(true)
        ;

    // Find the position of all occurrences of the search term
    // on the first page of the target document.

    let pdfium = Pdfium::default();

    let document = pdfium.load_pdf_from_fetch(url, None).await.unwrap();

    let page = document.pages().first().unwrap();

    let search_results_bounds = page
        .text()
        .unwrap()
        .search(search_term, &search_options)
        .iter(PdfSearchDirection::SearchForward)
        .enumerate()
        .flat_map(|(index, segments)| {
            segments
                .iter()
                .map(|segment| {
                    log::info!(
                        "Search result {}: `{}` appears at {:#?}",
                        index,
                        segment.text(),
                        segment.bounds()
                    );

                    segment.bounds()
                })
                .collect::<Vec<_>>()
        })
        .collect::<Vec<_>>();

    log::info!("{} search results", search_results_bounds.len());
}

When compiled to WASM and executed from a webpage, the output 0 search results is logged to the Javascript console. Based on the output of the text_search example, I would expect to see 5 search results.

ajrcarey commented 1 month ago

The problem is a mishandling of the FPDF_WIDESTRING pointer type in the WASM implementation of the FPDFText_FindStart() function. When corrected, the output is:

 Search result 0: `French ` appears at PdfRect {
    bottom: PdfPoints {
        value: 329.72476,
    },
    left: PdfPoints {
        value: 361.7285,
    },
    top: PdfPoints {
        value: 337.8637,
    },
    right: PdfPoints {
        value: 394.82318,
    },
}
Search result 1: `French ` appears at PdfRect {
    bottom: PdfPoints {
        value: 288.63016,
    },
    left: PdfPoints {
        value: 249.18015,
    },
    top: PdfPoints {
        value: 296.7691,
    },
    right: PdfPoints {
        value: 282.27484,
    },
}
Search result 2: `French ` appears at PdfRect {
    bottom: PdfPoints {
        value: 268.03287,
    },
    left: PdfPoints {
        value: 269.67044,
    },
    top: PdfPoints {
        value: 276.1718,
    },
    right: PdfPoints {
        value: 302.76514,
    },
}
Search result 3: `French` appears at PdfRect {
    bottom: PdfPoints {
        value: 247.53555,
    },
    left: PdfPoints {
        value: 222.95956,
    },
    top: PdfPoints {
        value: 255.6745,
    },
    right: PdfPoints {
        value: 256.05423,
    },
}
Search result 4: `French ` appears at PdfRect {
    bottom: PdfPoints {
        value: 103.65446,
    },
    left: PdfPoints {
        value: 166.09702,
    },
    top: PdfPoints {
        value: 111.79339,
    },
    right: PdfPoints {
        value: 199.19168,
    },
}
5 search results

This gives the correct number of search results and output that matches the text_search example.

Preparing a patch now.

ajrcarey commented 1 month ago

Pushed fix to FPDF_WIDESTRING handling in WASM bindings. Fix will be included in crate release 0.8.24. In the meantime, you can take pdfium-render as a git dependency in your Cargo.toml to test the fix.

synoet commented 1 month ago

Thank you very much! It works as expected now.

And thank you so much for your work on this project!