Implement decoding of Unicode characters

JohnAZoidberg commented 3 years ago

Possible duplicate of #86 if decoding and encoding would need to be implemented together.

I've got a PDF with the text 打尼爾 and

println!("{:?}", doc.extract_text(&vec![1]).unwrap());

yields:

"?Identity-H Unimplemented??Identity-H Unimplemented??Identity-H Unimplemented?\n"

KoStard commented 3 years ago

Hello, any updates here?

enzingerm commented 3 years ago

Hey, for a personal project I needed text extraction from OCR'd PDFs which use Identity-H encoding and a ToUnicode CMap. I implemented the basic functionality into my fork of lopdf. It can be found here: https://github.com/enzingerm/lopdf/tree/unicode_cmap It works for the PDFs I work with but I'm quite sure it won't work for other kinds of PDFs due to the complexity of the standard and my basic implementation. Maybe anyone wants to give it a try. Feedback is appreciated :)

dkaluza commented 4 months ago

Updated @enzingerm code to the recent main branch on my fork. https://github.com/dkaluza/lopdf/tree/unicode-cmap

Also added some tests and hotfixed issue that parsing failed due to non space white chars in ToUnicode cid_system_info definition. (I'm not an expert in the PDF standard, therefore only did some improvements to allow fonts in my pdfs to be parsed) Probably still needs a nom parser before merge request to main.

jackpot51 commented 4 months ago

@dkaluza I think you need to add pom to Cargo.toml dependencies

dkaluza commented 4 months ago

@jackpot51

I believe it is specified as an optional dependency, and can be enabled with pom_parser feature: See: https://github.com/J-F-Liu/lopdf/blob/26d8380b2b1a92bcbda058dec81076bdfa335a5d/Cargo.toml#L31

Features can be added to a cargo command e.g. tests in the repo might be run with:

cargo test --no-default-features --features "chrono_time pom_parser rayon"

Or probably more useful specified in Cargo.toml of the repository depending on lopdf, see: https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#choosing-features

I would prefer not to change the main dependencies of the repository. It would be great to also add nom parser implementation to make it runnable with default features configuration.

jackpot51 commented 4 months ago

I had tried to compile your fork with default features and had an issue with pom not being found. I will add the pom_parser feature, thanks.

dkaluza commented 4 months ago

If I understand the dependencies of the lopdf correctly the pom and nom parser features are intended to be exchangeable. You should be able to use all the library with one or the other. (Although as mentioned there is no nom implementation for the Unicode decoding yet...)

Anyway I will make for now the pom_parser a default feature on my fork to avoid future confusions.

enzingerm commented 4 months ago

I'm pleased to see that my old work might be of some use. I don't remember exactly but I just chose to use whichever parser seemed to be more simple to implement for me and ignored the other variant. But I think you're right, one should choose either pom or nom parser.

wmeints commented 3 months ago

I found this issue because I ran into some PDF files encoded as described here. I tried the fork by @dkaluza, but I'm getting different results. My PDF has "Hello, world!" on a single line, but when I run extract_text against that page, I get back "Hello,\nworld!\n" as the output. I'm not sure if that's related to the encoding.

dkaluza commented 3 months ago

I have also encountered this problem in one of the tests.

In my case it looked like added by "ET" operator here: https://github.com/dkaluza/lopdf/blob/unicode-cmap/src/parser_aux.rs#L94 But since the same code exists on main branch: https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94

I suspect additional investigation in the specification of pdf is needed to handle this correctly.

Heinenen commented 2 months ago

I'm not very experienced with PDFs yet, but from reading the spec, it doesn't seem that there is a reliable way to determine line breaks. The closest thing to line breaks in PDFs are probably the Td, TD, and T* operators.

From what I have seen in Chrome and Adobe, it rather seems like they try to apply some heuristic to determine if a \n should be inserted. A simple heuristic that comes to my mind is checking the y-coordinate of the text: if the y-coordinate is the same between two different BT/ET blocks, we don't need to insert a newline character.

If we want to discuss this further, I think we should put it into its own issue to not hijack this thread further.

PS: Tagged PDFs are probably capable of conveying where a line break is intended, but they are simply not used often in the real world (to my knowledge).

Heinenen commented 2 months ago

@dkaluza, do you mind me trying to bring your fork into a mergable state (resolving conflicts, implementing the nom parser) and/or do you have anything that I should be aware of?

Also, does any of you guys know why we have two parsers (except for historical reasons). From my naive perspective, the biggest difference between the two is that the nom parser is a lot faster (according to the original PR, https://github.com/J-F-Liu/lopdf/pull/60). In my mind, it would thus make sense to remove the pom parser completely to reduce code complexity and work.

dkaluza commented 2 months ago

@Heinenen I will try to implement the nom version in the next week.

I have just resolved the conflicts and pushed updated version to my fork. I have no knowledge about why both parsers are currently supported, but I agree that it would be easier to support one, although it should be probably discussed with maintainers.

Earlier I had some concerns that current solution isn't fully covering the specification in the matter of the Unicode characters encoding or other encodings. But perhaps it can be merged as a MVP and improved further when cases not covered are found.

For example it currently looks like PDFDocEncoding is currently not properly handled in annotations and other text strings.

Below is the test build with fragment of updating example from the spec that shows that ellipsis is currently not properly decoded:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn get_anotations_retrieves_pdf_doc_encoded_text() {
        // annotation with subtype text from pdf documentation section Updating Example
        let data = r#"%PDF-1.4
1 0 obj
<< /Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<< /Type Outlines
/Count 0
>>
endobj

3 0 obj
<< /Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj

4 0 obj
<< /Type /Page
/Parent 3 0 R
/MediaBox [0 0 612 792]
/Contents 5 0 R
/Resources << /ProcSet 6 0 R >>
/Annots 7 0 R
>>
endobj

5 0 obj
<< /Length 35 >>
stream
…Page-marking operators…
endstream
endobj

6 0 obj
[/PDF]
endobj

7 0 obj
[
 8 0 R
]
endobj

8 0 obj
<< /Type /Annot
/Subtype /Text
/Rect [44 253 473 337]
/Contents (New Text #3\203a longer text annotation which we will continue \
onto a second line)
/Open true
>>
endobj

xref
0 9
0000000000 65535 f
0000000009 00000 n
0000000075 00000 n
0000000121 00000 n
0000000179 00000 n
0000000313 00000 n
0000000392 00000 n
0000000415 00000 n
0000000442 00000 n

trailer
<< /Size 9
/Root 1 0 R
>>
startxref
622
%%EOF"#;

        let document = Document::load_mem(data.as_bytes()).unwrap();

        let page_id = document.get_pages().get(&1).unwrap().to_owned();
        let annotations = document.get_page_annotations(page_id);
        use crate::Object::Integer;
        let expected_rect = Object::Array(vec![Integer(44), Integer(253), Integer(473), Integer(337)]);
        assert_eq!(
            annotations,
            vec![&dictionary! {
                "Type" => "Annot",
                "Subtype" => "Text",
                "Rect" => expected_rect,
                "Contents" => Object::string_literal("New Text #3…a longer text annotation which we will continue onto a second line"),
                "Open" => Object::Boolean(true)
            }]
        );
    }
}

Heinenen commented 2 months ago

Alright, then good luck with that :)

Earlier I had some concerns that current solution isn't fully covering the specification in the matter of the Unicode characters encoding or other encodings. But perhaps it can be merged as a MVP and improved further when cases not covered are found

I think this is a larger problem and should be in a separate PR. Also see the issues #86 and #110. Essentially, lopdf does not decode/encode the text strings at all (no matter if it's PDFDocEncoding, UTF-16BE or UTF-8). You can take a look at my comment https://github.com/J-F-Liu/lopdf/issues/86#issuecomment-2267501122 which describes this with a little more detail.

I just wanted to nitpick your test case that "Contents" should be a string literal instead of a Name, but you were faster! 😉

Heinenen commented 2 months ago

@dkaluza I read through #217, which discusses the same things as this one. I tested whether, with your fork, the attached PDFs work as intended. There are 3 PDFs in there:

stallman [0]: Even with your fork, I still get gibberish. I don't know why, you are probably in a better position to figure it out.
csnt44-2023 [1]: Panics with your fork, doesn't panic on the master branch. The reason for that seems to be that the master branch uses the nom parser and somehow the pom parser fails to parse the "Pages" dict.
dkp [2]: seems to work (I only checked like 2 pages and didn't read all 28 😅)

Links to comments that contain the PDFs: [0] https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1502360657 [1] https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1657266012 [2] https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1457367413

dkaluza commented 2 months ago

@Heinenen thanks for the references, ok I will look into those 3 pdfs and see what I can do during implementation.

Update: Indeed it looks like for csnt44-2023 [1] pom parser does not extract pages. Although I got the same result on master with pom parser, so probably this is unrelated issue and might work out of the box after nom parser implementation. Anyway it is still worth to consider whether pom parser should be fixed to address this issue or considered deprecated.

dkaluza commented 1 month ago

@Heinenen

FYI, I have implemented the nom parser version of the ToUnicode cmap extraction, you can currently find it on my fork.

Mentioned pdfs: csnt44-2023 [1] and dkp [2] look good to me. (judging by the first page)

stallman [0] still does not work, I will try to investigate it this week. I also see that I have some fresh conflicts, I will also try to solve them this week to submit a MR.

Update: After investigation stallman [0] is using a TrueType font, which has a ToUnicode entry mapping single bytes to unicode characters. Handling also single byte fonts requires significant extension of current approach. I will first fix conflicts and then see whether this can be handled in some elegant way.

dkaluza commented 1 month ago

Submitted pull request to handle this issue.

Currently pull request does not handle one byte ranges mapping which occur for example in TrueType fonts. To be considered if issue should be left open until this is also implemented or if this should be handled as separate issue.

Tasty213 commented 1 month ago

Hi, tried running the code in your PR but couldn't manage to get it to properly decode my PDF (don't let that stop you merging etc I'm sure it's a step in the right direction), just wondering if anyone knows what might be the issue? example.pdf

dkaluza commented 1 month ago

Worked quite good for me, extracted text:

\nWalk Name:\nRLA5\nPolling District:\nRLA\nDoors:\n102\nLeaflets:\n0\nDeliver By:\nCodes\nNotes\nAddresses\nStreet:Haigh Gardens: 1-15, 2-28\nHaigh Terrace: 1-27, 2-28\nSt Georges Avenue: 1-21, 2-22\nSt Georges Crescent: 1-11, 2-16\nWood Lane: 185-251\n

Am I missing something?

Tasty213 commented 1 month ago

Hi,

Ahh very interesting, I must be running it wrong? What command did you use?

Best wishes George

On Thu, 22 Aug 2024, 20:58 dkaluza, @.***> wrote:

Worked quite good for me, extracted text:

\nWalk Name:\nRLA5\nPolling District:\nRLA\nDoors:\n102\nLeaflets:\n0\nDeliver By:\nCodes\nNotes\nAddresses\nStreet:Haigh Gardens: 1-15, 2-28\nHaigh Terrace: 1-27, 2-28\nSt Georges Avenue: 1-21 https://www.google.com/maps/search/Georges+Avenue:+1-21?entry=gmail&source=g, 2-22\nSt Georges Crescent: 1-11, 2-16\nWood Lane: 185-251\n

Am I missing something?

— Reply to this email directly, view it on GitHub https://github.com/J-F-Liu/lopdf/issues/125#issuecomment-2305526313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHKSW6S2Z52BLCC7BYX4SF3ZSY7E5AVCNFSM4TQIYGPKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZQGU2TENRTGEZQ . You are receiving this because you commented.Message ID: @.***>

dkaluza commented 1 month ago

@Tasty213

Take a look here: https://github.com/J-F-Liu/lopdf/blob/5859443ae4423eb0af6c7e40551e089595ba7fcf/tests/unicode.rs#L131

Basically this test loads a pdf and extracts text from the first page of the file.

There is also a text ectraction example but it is quite complex. I recall it had some sort of font filtering built in, so I am not sure whether it works with unicode fonts. I will check later if it extracts unicode correctly.

Let me know if you need some more help with runing it.

J-F-Liu / lopdf

Implement decoding of Unicode characters #125