caj2pdf / caj-convert

command-line caj conversion utility using the genuine/original libreaderex, code originally posted at https://github.com/caj2pdf/caj2pdf/issues/63
8 stars 4 forks source link

Blank pages created #1

Closed ghost closed 3 years ago

ghost commented 3 years ago

See https://github.com/caj2pdf/caj2pdf/issues/63#issuecomment-880253362.

ghost commented 3 years ago

Users may encounter this issue because Resource directory is missing.

Along with ReaderEx_x64.dll (and its dependencies) or libreaderex_x64.so, the Resource directory is necessary. Maybe we should edit this line.

https://github.com/caj2pdf/caj-convert/blob/6705e300bcaeebe77efc86697bb95f7a4de62747/caj-convert.c#L10

[comment]: # (BTW: The code only works on x64 architecture. x86 version ReaderEx.dll takes only three flags instead of four. https://github.com/caj2pdf/caj-convert/blob/6705e300bcaeebe77efc86697bb95f7a4de62747/caj-convert.c#L44 and this should be param.flag[2] = 0x26;. https://github.com/caj2pdf/caj-convert/blob/6705e300bcaeebe77efc86697bb95f7a4de62747/caj-convert.c#L104)

HinTak commented 3 years ago

You have tested this? I seem to remember some caj's simply don't display correctly with the whole original app (on Linux). When it does not correctly display it is likely conversion is buggy too.

I'll give it a try at some point.

ghost commented 3 years ago

Yes. The sample you mentioned can be converted. Result attached.

THE PROBLEM IS, as you mentioned here, fonts are missing.

3,6c3,12
< HGHT_CNKI                            CID TrueType      Identity-H       no  no  no      31  0
< B4+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     34  0
< B5+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     37  0
< B6+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     40  0
---
> B3+SimSun                            CID TrueType      Identity-H       yes no  yes     45  0
> B4+华光楷体_CNKI                     CID TrueType      Identity-H       yes no  yes     48  0
> B5+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     51  0
> B6+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     54  0
> B7+华光仿宋_CNKI                     CID TrueType      Identity-H       yes no  yes     57  0
> B8+华光黑体_CNKI                     CID TrueType      Identity-H       yes no  yes     60  0
> B9+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     63  0
> B10+华光书宋_CNKI                    CID TrueType      Identity-H       yes no  yes     66  0
> B11+SimSun                           CID TrueType      Identity-H       yes no  yes     69  0
> B12+SimSun                           CID TrueType      Identity-H       yes no  yes     72  0
ghost commented 3 years ago

As you can see, ReaderEx_x64.dll or ReaderEx.dll on Windows do produce nicer results. File produced by libreaderex_x64.so has less fonts embedded.

Visual comparison below.

ReaderEx_x64 libreaderex_x64

ghost commented 3 years ago

We may further investigate why fonts are not embedded in files produced by libreaderex_x64.so (this happens especially when the sources are pure-text HN). Otherwise, the converted PDF is still readable, so it is only text style that matters...

Currently, if one wishes a nicer conversion, I suggest running this on Windows.

HinTak commented 3 years ago

Whether fonts are embedded is less important than whether the encoding information is there. (most of the pdf viewers can substitute with a suitable font replacement if the encoding information is present). This means the "Resource/Adobe-GB1.bin" file is needed at least. And it should show as "Adobe-GB1-H" instead of identity-H).

One thing one can do is to trace the open system call with strace on Linux to see what files the process tries to open (and fail).

HinTak commented 3 years ago

Yes. The sample you mentioned can be converted. Result attached.

Your files appear to have been changed after production?

HinTak commented 3 years ago

I tried running both the 32-bit windows and the 64-bit windows version under wine. The result is even more interesting.

Both treat simsun special, but different. windows version does not even draw any text if it is missing (like in wine); linux version will use its name for drawing if present, but draw with HGHT_CNKI if missing, and does not embed in either case.

To run the windows version with wine, you need, besides the 'Resource" directory, copying simsun to wine's c:\windows\fonts, extract also freetype.dll , ImageCodec.dll libcrypto-1_1.dll (or the -x64.dll/_x64.dll versions) which ReaderEx.dll depends on.

ghost commented 3 years ago

Great! Thank you for the research. The files I uploaded were stripped, thus, things like file metadata and file time are not present. I am curious about how you can run the 32-bit Windows version. Have you fixed the Parameter structure?

I examined the PDF file downloaded directly from CNKI. No, no Adobe-GB1-H.

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
B3+SimSun                            CID TrueType      Identity-H       yes no  yes      3  0
B4+楷体                              CID TrueType      Identity-H       yes no  yes      4  0
B5+CAJSymbolA                        CID TrueType      Identity-H       yes no  yes      5  0
B6+CAJ FNT00                         TrueType          WinAnsi          yes no  yes      6  0
B7+仿宋                              CID TrueType      Identity-H       yes no  yes      7  0
黑体                                 CID TrueType      Identity-H       no  no  no       8  0
B9+CAJ FNT04                         TrueType          WinAnsi          yes no  yes      9  0
SimSun                               CID TrueType      Identity-H       no  no  no      10  0

On Windows(no matter x86 or x64):

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
B3+SimSun                            CID TrueType      Identity-H       yes no  yes     45  0
B4+华光楷体_CNKI                     CID TrueType      Identity-H       yes no  yes     48  0
B5+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     51  0
B6+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     54  0
B7+华光仿宋_CNKI                     CID TrueType      Identity-H       yes no  yes     57  0
B8+华光黑体_CNKI                     CID TrueType      Identity-H       yes no  yes     60  0
B9+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     63  0
B10+华光书宋_CNKI                    CID TrueType      Identity-H       yes no  yes     66  0
B11+SimSun                           CID TrueType      Identity-H       yes no  yes     69  0
B12+SimSun                           CID TrueType      Identity-H       yes no  yes     72  0

On Linux:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
HGHT_CNKI                            CID TrueType      Identity-H       no  no  no      31  0
B4+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     34  0
B5+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     37  0
B6+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     40  0

On Linux(after copying simsun.ttc):

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SimSun                               CID TrueType      Identity-H       no  no  no       3  0
B4+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes      4  0
B5+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes      5  0
B6+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes      6  0
ghost commented 3 years ago

On 32bit Windows, this should be uint32_t flag[3];. https://github.com/caj2pdf/caj-convert/blob/963e16e519e8948557025318460483fa467320e2/caj-convert.c#L47 And this should be param.flag[2] = 0x26;. https://github.com/caj2pdf/caj-convert/blob/963e16e519e8948557025318460483fa467320e2/caj-convert.c#L107

HinTak commented 3 years ago

Yes, I had your earlier message on 32-bit windows in a email, although you deleted it on github. It needs a 3nd line of change - the dll name is "Readerex.dll" without the "x64" part. I have the changes/additions locally but haven't commited them yet. It needs some rather ugly "ifdef _WIN64 "... Also the moved/relocated flags seem to suggest it is just padding difference between 32-bit and 64-bit. I wonder if struct param should be written differently.

HinTak commented 3 years ago

I don't mean that. What I meant was that since the next item in parm is "*src", which is a pointer, the compiler would align it to 64-bit and pad the struct before it to make sure it is on a 8-byte boundary. I wonder if there is a way of writing the struct, such that the compiler would automatically move/pad the previous item into the locations you need, without any explicit "#ifdef". Putting one of the values before it as "long" or "long long" (which is different in size at 32-/64-bit) might do it. We still have the Linux case which again is different regarding size of "long" and "long long".

HinTak commented 3 years ago

I examined the PDF file downloaded directly from CNKI. No, no Adobe-GB1-H.

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
B3+SimSun                            CID TrueType      Identity-H       yes no  yes      3  0
B4+楷体                              CID TrueType      Identity-H       yes no  yes      4  0
B5+CAJSymbolA                        CID TrueType      Identity-H       yes no  yes      5  0
B6+CAJ FNT00                         TrueType          WinAnsi          yes no  yes      6  0
B7+仿宋                              CID TrueType      Identity-H       yes no  yes      7  0
黑体                                 CID TrueType      Identity-H       no  no  no       8  0
B9+CAJ FNT04                         TrueType          WinAnsi          yes no  yes      9  0
SimSun                               CID TrueType      Identity-H       no  no  no      10  0

It seems that pdffonts is buggy (or has became buggy since it got adopted by the poppler folks). If you do strings ... | more the same pdf you posted, scroll down a bit, you find object 34 and object 8:

34 0 obj
/Type /Font
/Subtype /CIDFontType2
/BaseFont /#ba#da#cc#e5
/FontDescriptor 33 0 R
/CIDSystemInfo<</Registry (Adobe)/Ordering (GB1) /Supplement 0>>
/DW 1000
/W[ 
endobj
8 0 obj
/Type/Font
/Subtype/Type0
/DescendantFonts [34 0 R]
/BaseFont /#ba#da#cc#e5
/Name /F5
/Encoding /Identity-H
endobj

#ba#da#cc#e5 is postscript hex-escaped "黑体" in gb2312 encoding. This snipplet says "黑体" is a type0 font derived from the CID Font Type2 "黑体" with Adobe-GB1-0 encoding. (the Identity-H in the type0 entry is simply not-informative - it just means it is unchanged from the ancestor's Adobe-GB1-0).

It appears that ReaderEx (or the official conversions) often defaults not to embed fonts shipped by microsoft, like "黑体" (simhei) and SimSun.

ghost commented 3 years ago

Try this.

struct Parameter
{
    // The size of the structure, in bytes.
    size_t cb;
    uint32_t flag[2];
    char *src;
    char *extname;
    // Function pointers for open, read, seek, tell, eof, and close.
    void *pfnFILE[6];
    char *dest;
    // Function pointers.
    void *pfnoss[4];
};

Then modify the code accordingly. And are you wondering why we used the magic value 0x78? It is just the size of (padded) structure on (most) x64 systems. Of course you can update it to param.cb = sizeof param;. Several hours ago I was slept-coding so... Sorry for my mistake and thanks for making me aware of it.

HinTak commented 3 years ago

Okay, that looks much better. But it means there is a 4th difference of 32-bit vs 64-bit - that sizeof value. This likely means the value is not read/checked/used at all - now I wonder what happens if we put garbage (like zero...) to it :) .

ghost commented 3 years ago

I cannot get the point. What are the other three differences? Usually putting garbage into it does not affect the behavior. However, one should never try this, for it is unsafe. Making the first member of structure the size of structure itself is a common practice.

Though there is no guarantee in standard, most compilers treat size_t 4-bytes on x86 targets and 8-bytes on x64 targets. Problem fixed.

ghost commented 3 years ago

I don't mean that. What I meant was that since the next item in parm is "*src", which is a pointer, the compiler would align it to 64-bit and pad the struct before it to make sure it is on a 8-byte boundary. I wonder if there is a way of writing the struct, such that the compiler would automatically move/pad the previous item into the locations you need, without any explicit "#ifdef". Putting one of the values before it as "long" or "long long" (which is different in size at 32-/64-bit) might do it. We still have the Linux case which again is different regarding size of "long" and "long long".

Yes. In the earlier comment I have rewritten the structure. size_t is what you need.

I guess what you mentioned "4 differences" are as following:

1. different structure (solved by applying size_t); 2. different flag assignment(no longer a problem now that structure is unified);

  1. different library filename; 4. different cb value(just apply sizeof).

Have a nice weekend :-)

HinTak commented 3 years ago

Made the change, https://github.com/caj2pdf/caj-convert/commit/3bd6634c8c25e050c2617d901ab49cef6b287e88

BTW, the macros and checks for CAJ2PDF_OS_WINDOWS / CAJ2PDF_OSLINUX are a bit redundant. macros starting with `(and) are supposed to be internal to the compiler's preprocessor and not user-defined. While for that reasons, there is no "standard" to them, but since everything on linux is gcc and everything on windows is visual studio, even clang and mingw gcc tries to be compatible with them with the internal platform macros, solinux__and_WIN64,_WIN32(stupidly microsoft's 64-bit compiler also defines _WIN32/WIN32, so this is not useful!) are just fine. I'd say that just usinglinuxand_WIN64` is okay, and there is no need to check for conflicts either, since you are not supposed to define these yourself. I don't feel too strongly about them to change though.

It is also slightly better to do two consecutive and independent 'ifdef endif's, rather than one ifdef elif end, because the former allows one to re-order/re-arrange/delete code easier. In fact arguably it is better to split the two platforms into two mains, instead of scattering a lot of ifdef in the middle. But - I don't feel strongly enough about it to modify code which is already working, just for "style" or ideology.

HinTak commented 3 years ago

Oh, and the windows calling by ordinal is very flaky... that means one needs to keep a specific version of dlls.

ghost commented 3 years ago

Hah. You said it. Initially, I did not expect there to be many differences between platforms. So I did not split the code.

It is also slightly better to do two consecutive and independent 'ifdef endif's, rather than one ifdef elif end, because the former allows one to re-order/re-arrange/delete code easier.

Yes. My fault. Tell me to split it if you wish. You are always welcome.

Oh, and the windows calling by ordinal is very flaky... that means one needs to keep a specific version of dlls.

AFAIK, if CNKI's programmers do not add functions into or remove them from the source code, the ordinal should not change.

BTW, the macros and checks for CAJ2PDF_OS_WINDOWS / CAJ2PDF_OS_LINUX are a bit redundant.

Well, this is for compatibility on most compilers. The code should be almost same with boost's. You can use Boost.Predef instead, which hides details and is more accurate.

lelandyang commented 3 years ago

Hah. You said it. Initially, I did not expect there to be many differences between platforms. So I did not split the code.

It is also slightly better to do two consecutive and independent 'ifdef endif's, rather than one ifdef elif end, because the former allows one to re-order/re-arrange/delete code easier.

Yes. My fault. Tell me to split it if you wish. You are always welcome.

Oh, and the windows calling by ordinal is very flaky... that means one needs to keep a specific version of dlls.

AFAIK, if CNKI's programmers do not add functions into or remove them from the source code, the ordinal should not change.

BTW, the macros and checks for CAJ2PDF_OS_WINDOWS / CAJ2PDF_OS_LINUX are a bit redundant.

Well, this is for compatibility on most compilers. The code should be almost same with boost's. You can use Boost.Predef instead, which hides details and is more accurate.

In fact, ReaderEx library was upgraded frequently, some functions were added. It is therefore recommended to attach a copy of library that has a oridinal 216. :smile:

HinTak commented 3 years ago

Boosts needs to support a dozen of OSes. There are only 3 for readerex. Just __linux__ / _WIN64 / __apple__ and the cpu flags are enough. Why make it longer than necessary?

The ordinal issue is a concern.

lelandyang commented 3 years ago

I by chance discovered that the library shipped with CNKI_express v0.0.3 and CNKI_express v0.0.11 has the same release number, but of different size & hash. File ReaderEx_library.zip attached I did not explore if it was due to compilation optimization & compiling params or simply edited the library and kept rc file unchanged. .

捕获 无标题

ghost commented 3 years ago

Almost the same, I think, based on Entropy. ReaderEx_x64-11.dll ReaderEx_x64-03.dll

HinTak commented 3 years ago

The Entropy graph is somewhat meaningless, for analysing differences between dlls, since the bulk of entropy is in the common parts.

Anyway, I happened to have unpacked the earlier Readex.dll from 7.3.133 (current is 7.3.141) and it seems that the dependency on libcrypto-1_1.dll is new, as the 9-month-old dll from 7.3.133 did not depend on it. So I suspect the ordinals may change quite frequently.

Since the windows dll seems to work out under wine (with "Resource" copied and simsun.tff copied), and since it uses freetype.dll, it should be interesting to look at what calls it makes to freetype (either with a customized freetype, or via wine's debug relay function for studying communications between dll's).

HinTak commented 3 years ago

One more thing - I thought the software may respect the truetype "can enbed" flag (it is in the OS/2 table), but all the fonts in cajfonts and simsun allows editable embedding, so that's not it. Simsun itself seems to be special within Readex.dll (associated with GB1).

HinTak commented 3 years ago

Yes. The sample you mentioned can be converted. Result attached.

THE PROBLEM IS, as you mentioned here, fonts are missing.

3,6c3,12
< HGHT_CNKI                            CID TrueType      Identity-H       no  no  no      31  0
< B4+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     34  0
< B5+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     37  0
< B6+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     40  0
---
> B3+SimSun                            CID TrueType      Identity-H       yes no  yes     45  0
> B4+华光楷体_CNKI                     CID TrueType      Identity-H       yes no  yes     48  0
> B5+HGFX_CNKI                         CID TrueType      Identity-H       yes no  yes     51  0
> B6+HGBZ_CNKI                         CID TrueType      Identity-H       yes no  yes     54  0
> B7+华光仿宋_CNKI                     CID TrueType      Identity-H       yes no  yes     57  0
> B8+华光黑体_CNKI                     CID TrueType      Identity-H       yes no  yes     60  0
> B9+HGHZ_CNKI                         CID TrueType      Identity-H       yes no  yes     63  0
> B10+华光书宋_CNKI                    CID TrueType      Identity-H       yes no  yes     66  0
> B11+SimSun                           CID TrueType      Identity-H       yes no  yes     69  0
> B12+SimSun                           CID TrueType      Identity-H       yes no  yes     72  0

I wonder the difference may be due to case sensitivity. Resource/fontmap.xml have file names all in upper cases. The three that appear in the Linux version do not appear in either lower or upper case in the file. Maybe one thing to try is to sym-link / copy uppercase versions of all the font files.

lelandyang commented 3 years ago

One more thing - I thought the software may respect the truetype "can enbed" flag (it is in the OS/2 table), but all the fonts in cajfonts and simsun allows editable embedding, so that's not it. Simsun itself seems to be special within Readex.dll (associated with GB1).

editable embedding you are referring to means to embed the whole truetype font data, embedding a subset of ttf will make the PDF not editable. SimSun and Heiti are two CIDFonts, different from truetype fonts. Only matrix data was stored in the result PDF. What is more interesting is that there are couple of type1 fonts shipped in the Resource folder, just unzip fonts.bin, you will find dozens of them.

lelandyang commented 3 years ago

The Entropy graph is somewhat meaningless, for analysing differences between dlls, since the bulk of entropy is in the common parts.

Anyway, I happened to have unpacked the earlier Readex.dll from 7.3.133 (current is 7.3.141) and it seems that the dependency on libcrypto-1_1.dll is new, as the 9-month-old dll from 7.3.133 did not depend on it. So I suspect the ordinals may change quite frequently.

Since the windows dll seems to work out under wine (with "Resource" copied and simsun.tff copied), and since it uses freetype.dll, it should be interesting to look at what calls it makes to freetype (either with a customized freetype, or via wine's debug relay function for studying communications between dll's).

The libreaderEx was frequently updated, I suppose there were even refactor: several map files was removed from the dll and shipped standalone, with extra flower rims added as bitmaps.

HinTak commented 3 years ago

I have extracted all the windows dlls I have - https://github.com/caj2pdf/ReaderEx-Archive , plus the current linux ones too, and the Resource files. (and two more linux ones from the cnkiexpress). That's 6 32-bit ones, 2 64-bit ones, and 3 linux ones.

Regarding the two versions of 64-bit 2.3.3982.0 - I know one differences: the shipped ImageCodec_x64.dll are identical, but freetype.dll and libcrypto-1_1-x64.dll were unsigned in 0.0.3 but have digital signatures in 0.0.11 (otherwise identical). The dependent libraries being signed would affect how ReaderEx_x64.dll is linked. I am not able to go any further, but it would be interesting to know if they differ besides the two dependent libraries being unsigned in earlier and signed in the later one. (the signatures would affects memory offsets in linking, at least...)

I have only used the cnkiexpress 0.0.11 version of the 64-bit dlls (3982) and the Viewer 7.3.141 version of the 32-bit dlls (3983). Apparently they switched from depending on libeay32 to libcrypto very recently between 2.3.3981.0 and 2.3.3982.0* . (just one build earlier!). That's fairly major change, quite recently.

One more thing - I thought the software may respect the truetype "can enbed" flag (it is in the OS/2 table), but all the fonts in cajfonts and simsun allows editable embedding, so that's not it. Simsun itself seems to be special within Readex.dll (associated with GB1).

editable embedding you are referring to means to embed the whole truetype font data, embedding a subset of ttf will make the PDF not editable. SimSun and Heiti are two CIDFonts, different from truetype fonts. Only matrix data was stored in the result PDF. What is more interesting is that there are couple of type1 fonts shipped in the Resource folder, just unzip fonts.bin, you will find dozens of them.

You need to read the opentype specification (the ISO standard for fonts). Editable Embedding is fsType = 0x0008 in the OS/2 Table. It can take other values such as Installable Embedding, Restricted License embedding, Preview & Print embedding, Editable embedding, No subsetting, Bitmap embedding only; but so far none of the fonts concerned have any other value than 0x0008.

Yes, "fonts.bin" contains the URW clones of the standard 14 fonts. URW released them under public-domain about 20 years ago, as public-domain replacements of the copyrighted Times / Helvetica / Courier ( x 4 styles, Roman / Italic / Bold / Bold-Italic), Symbol and Wingding. So there are exactly 14 of them. One the typical linux systems, you can find them under /usr/share/ghostscript/Resource/Font/, or /usr/share/X11/fonts/Type1/, or both locations.

Redhat commissioned the "Liberation" family of fonts a few years ago, which are better replacements to recent Times / Helvetica / Courier .

HinTak commented 3 years ago

If you have other hidden urls under https://download.cnki.net/ (like you can still blindly download cnkiexpress 0.0.3, although it is not listed / listable), I'd be happy to extract them and put them up too; but I'd prefer not to do so from other sources.

HinTak commented 3 years ago

There is no improvement changing cases in Resource/fontmap.xml to match the on-disk cases.

lelandyang commented 3 years ago

I have extracted all the windows dlls I have - https://github.com/caj2pdf/ReaderEx-Archive , plus the current linux ones too, and the Resource files. (and two more linux ones from the cnkiexpress). That's 6 32-bit ones, 2 64-bit ones, and 3 linux ones.

Regarding the two versions of 64-bit 2.3.3982.0 - I know one differences: the shipped ImageCodec_x64.dll are identical, but freetype.dll and libcrypto-1_1-x64.dll were unsigned in 0.0.3 but have digital signatures in 0.0.11 (otherwise identical). The dependent libraries being signed would affect how ReaderEx_x64.dll is linked. I am not able to go any further, but it would be interesting to know if they differ besides the two dependent libraries being unsigned in earlier and signed in the later one. (the signatures would affects memory offsets in linking, at least...)

I have only used the cnkiexpress 0.0.11 version of the 64-bit dlls (3982) and the Viewer 7.3.141 version of the 32-bit dlls (3983). Apparently they switched from depending on libeay32 to libcrypto very recently between 2.3.3981.0 and 2.3.3982.0* . (just one build earlier!). That's fairly major change, quite recently.

One more thing - I thought the software may respect the truetype "can enbed" flag (it is in the OS/2 table), but all the fonts in cajfonts and simsun allows editable embedding, so that's not it. Simsun itself seems to be special within Readex.dll (associated with GB1).

editable embedding you are referring to means to embed the whole truetype font data, embedding a subset of ttf will make the PDF not editable. SimSun and Heiti are two CIDFonts, different from truetype fonts. Only matrix data was stored in the result PDF. What is more interesting is that there are couple of type1 fonts shipped in the Resource folder, just unzip fonts.bin, you will find dozens of them.

You need to read the opentype specification (the ISO standard for fonts). Editable Embedding is fsType = 0x0008 in the OS/2 Table. It can take other values such as Installable Embedding, Restricted License embedding, Preview & Print embedding, Editable embedding, No subsetting, Bitmap embedding only; but so far none of the fonts concerned have any other value than 0x0008.

Yes, "fonts.bin" contains the URW clones of the standard 14 fonts. URW released them under public-domain about 20 years ago, as public-domain replacements of the copyrighted Times / Helvetica / Courier ( x 4 styles, Roman / Italic / Bold / Bold-Italic), Symbol and Wingding. So there are exactly 14 of them. One the typical linux systems, you can find them under /usr/share/ghostscript/Resource/Font/, or /usr/share/X11/fonts/Type1/, or both locations.

Redhat commissioned the "Liberation" family of fonts a few years ago, which are better replacements to recent Times / Helvetica / Courier .

I misunderstood your issue, I thought you were talking about font embedding in PDF documents, which is what I talked about. I tested these font and yes, they can be embedded, but it is likely that due to licensing issues and PDF size optimization, CNKI chose not to embed certain fonts.

I personally advise to choose library from 7.3 or above, because those versions has a lot of improvement in pure text hn file rendering: such as flower rims and more font support.

lelandyang commented 3 years ago

:laughing: I made failed attempts on Linux 2021-07-19 16-56-37 的屏幕截图

HinTak commented 3 years ago

😆 I made failed attempts on Linux 2021-07-19 16-56-37 的屏幕截图

Source and dest path are both file names, extname does not matter.

Sorry I meant to quote reply instead of edit. Anyway, as I wrote, those two are not directories but just file names. Extname is ignore or something.

HinTak commented 3 years ago

Ie. "/someplace/in.caj", "", "/tmp/out.pdf" for the three inputs.

ghost commented 3 years ago

If you have other hidden urls under https://download.cnki.net/ (like you can still blindly download cnkiexpress 0.0.3, although it is not listed / listable), I'd be happy to extract them and put them up too; but I'd prefer not to do so from other sources.

Windows, x86, 2.0.3949.0 Windows, x86, 2.3.3982.0

Extname is ignore or something.

CAJ is not a certain file type. It is some sorts of a "file type system". For HN, KDH, CAJ (here this refers to file type, rather than file extension), and TEB, magic numbers already indicate the file type. However, for files with magic number C9A8B6FEB4F3B1B1 (GB2312-1980 扫二大北? Anybody who knows its meaning?), file extension is used together to get document type.

lelandyang commented 3 years ago

Ie. "/someplace/in.caj", "", "/tmp/out.pdf" for the three inputs.

Thank you, now it works. :smile: Perhaps it is better to use CAJ FileName and Destination FileName to indicate the input and output.

HinTak commented 3 years ago

I am not too bothered by either issue. The viewer is capable of auto-detect, so I know extname is not used for input type. It is just a windows programming convention for some GUI file dialog to only show certain files, I think. If it were up to me, I would test if (argc > 2) and just pass argv[1] and argv[2] along, and not prompt and wait for user to type anything.

Let's not forget this is just a way to study how it works, to improve caj2pdf. There is not need to make it user-friendly. What might be more interesting is to see if the windows ordinal changes at all; and how the result changes across the different versions.

HinTak commented 3 years ago

I'd prefer not to look at random attached dlls; and neither should you. It must come from an official location.

lelandyang commented 3 years ago

I'd prefer not to look at random attached dlls; and neither should you. It must come from an official location.

OK, as you wish. But you have to understand what is a digital signature. B.T.W. It IS from the official location, not "random attached", mind your wordings.

HinTak commented 3 years ago

It is "authenticode". As I commented above, the two versions of 64-bit 2.3.3982.0, differ by at least the two dependent dlls not being signed in one case (0.0.3) but signed in 0.0.11. So cnki does not always sign their software; and I prefer not to check such. The git log of the archive contains where each of the dlls come from, and the installer's md5sum too; for the same reason - you shouldn't trust dlls I upload either, but you should be able to get the same dlls from the recorded location and verify that they are the same.

HinTak commented 3 years ago

I am not sure what is the purpose of deleting a comment - its content is also e-mailed out.

HinTak commented 3 years ago

I found a lot of past versions still under cnki.net, by search forums on posted urls. These URLs work a few minutes ago:

http://viewer.d.cnki.net/CAJViewer%207.0.1.sfx.exe
http://viewer.d.cnki.net/CAJViewer%207.0.2_Eng.self.exe
http://viewer.d.cnki.net/CAJViewer%207.0.2.self.exe
http://viewer.d.cnki.net/CAJViewer%207.1.2.self.exe
http://viewer.d.cnki.net/CAJViewer%207.2.self.exe
http://viewer.d.cnki.net/CAJViewer%207.3.self.exe

%20 is url-escaped space, so this is 7.0.1, 7.0.2 ; do not know how 7.2 / 7.3 compared with the 7.2.0.117 and 7.3.131/7.3.133 from cajviewer.cnki.net .

On a different note, I would like to delete https://github.com/caj2pdf/ReaderEx-Archive , or at least put it under my own account (it seems although I could create it under the caj2pdf org, I don't have enough privilege to remove / move repos out of this org). I do not want to accept any DLLs from anybody else. And vice versa - it is for my private reference only, and considered read-only or checksum-comparison for everybody else . And you should download your own from official sources, and unpack yourself.

JeziL commented 3 years ago

@HinTak The member privileges settings have been adjusted. Members with admin permissions for the repo will be able to delete or transfer public and private repos.

it seems although I could create it under the caj2pdf org, I don't have enough privilege to remove / move repos out of this org

lelandyang commented 3 years ago

I am not sure what is the purpose of deleting a comment - its content is also e-mailed out.

I never mind if it is e-mail out or not, it is not a felony. I just removed 3 comments related to the reply concerning the "random dlls" from comments.

HinTak commented 3 years ago

@JeziL Thanks. I'll move it out soon.

I think we or at least, personally, I, do not want to encourage people to use DLLs downloaded from github. It is not a technically safe practice (virus/malware etc), and also legally not supposed to re-distribute. So they are for checksum comparison to those other downloaded independently, for the sake for discussions, really.

HinTak commented 3 years ago

Still remember pfnFILE in Parameter structure? It is used for debugging, I think. Use

Thanks for the code snipplets. the file-ops hooks needs to be used within their code (instead of direct fread/etc) for this to be useful though. But definitely worth having.

HinTak commented 3 years ago

However, for files with magic number C9A8B6FEB4F3B1B1 (GB2312-1980 扫二大北? Anybody who knows its meaning?), file extension is used together to get document type.

Is that not obvious? It is "北大二扫" ("Beijing University Second Scan") in small-endian, so it is reversed.

JeziL commented 3 years ago

second scan (二扫) is a term used in the field of typography. The first scan (一扫) is to check whether the menuscripts (小样) are entered correctly according to the specification, and the second scan (二扫) is to convert the menuscript into a file (大样) with output commands that the program can understand (kind of like the concept of PDF). As of Peking University (北大), it actually refers to Peking University Founder Group Co. Ltd (北大方正集团有限公司), which is the assignee of the patent of this format, not the university itself.