buda-base / digitization-guidelines

Digitization guidelines and tutorials for the BUDA project
https://buda-base.github.io/digitization-guidelines/
1 stars 0 forks source link

jpeg vs progressive jpeg #4

Open ngawangtrinley opened 4 years ago

ngawangtrinley commented 4 years ago

@eroux and @TBRC-JimK, the scanners we use in China don't have Zip compression for Tiff, while LZW is only available for 8 bits.

I believe we want to scan in Tiff 24 bits rather than 8 bits (16 is a no-n0), or do we:

TIFF 24 bits presents the following compression options, which one do we prefer:

We did analyse the pros and cons of various options and J2K seemed to be the best option at the time, check the documentation here.

eroux commented 4 years ago

24bit total = 8 bit per channel. Is the LZW compression available for it? (it's not super clear in your comment)

For remarkable artefacts, I still think we could use 48bit = 16 bit per channel, but that can be considered optional. 24 bit per channel is quite excessive.

If LZW and Zip are not available, then none. I suppose lossless jpeg2000 is not unreasonable if there's no other option...

ngawangtrinley commented 4 years ago

8 bits and 24 bits are two different options, so that must be 24 per channel. 16 isn't on the menu. LZW is not available for 24 bits.

eroux commented 4 years ago

ok, I suppose 8bit = one channel and 24 bit = three channels then. What are the options available outside of tiff?

(also, to answer one of your questions, progressive is better for large images but it's really a detail)

ngawangtrinley commented 4 years ago

@jeehuajian will post some screenshots of the setting menus tomorrow. Progressive JPEG is nearly half of JPEG and it visually looks clearer, so if there isn't any issue with progressive we might want to go for it.

eroux commented 4 years ago

ok thanks! half the size raises eyebrows... it's supposed to be only slightly smaller... what's the problem with producing uncompressed tiffs that can the be zip-compressed with xnview?

jimk-bdrc commented 4 years ago

Before you set a standard, run the XnView results through audittool. It can’t read everything, and I’ve seen NT create files audittool can’t read.

If you can figure out how to duplicate that problem, please let me know.

jimk-bdrc commented 4 years ago

This is a scoping question which may be too late: If you are using scanners, does this mean you are scanning printed material? If so, why not retain the old standard of binary TIFF with LZW compression for the vast majority of the pages, and keep archival quality TIFF for the front and back material, and any other color illustrations? In a 100 page book, it really doesn’t matter if 5 or 10 pages are uncompressed.

eroux commented 4 years ago

I agree, for generally black and white stuff, gray tiff in lzw is the best option. I think we're missing too much information to give a reasonable answer... what problem are we trying to solve? what's the context? what are the limitations or the users, of their machines, of the software, etc.?

ngawangtrinley commented 4 years ago

Scanners are used for both modern prints and pechas. They are used for everything as long as the paper isn't cardboard style. The staff on the ground has cameras but prefers Fujitsu scanners by far. For black and white material they used to scan straight to G4, which we then replaced by j2k as a single lossless color format for all archive images. Web images were then derived into G4 or JPEG depending on the content.

The problem we're trying to solve now is deciding what we replace j2k with. We could go back to 3 formats/compression for color, grayscale and BW. This matters since the scanner and software tutorials will cover the scanning-time settings. The constraints are simplicity, file size, and processing time.

eroux commented 4 years ago

So they discriminate between three cases (color, gray and bitonal), that's interesting... is lzw available in tiff for bitonal or just G4? Would something like j2k for color and lzw for gray and bitonal be simple enough?

Drongbulobsang commented 4 years ago

@jeehuajian will post some screenshots of the setting menus tomorrow. Progressive JPEG is nearly half of JPEG and it visually looks clearer, so if there isn't any issue with progressive we might want to go for it.

Screenshots of the scanner setting

eroux commented 4 years ago

just to be sure, can you send me a j2k that the scanner produces? I want to check if it's lossless or if they encode it in a lossy way... thanks!

eroux commented 4 years ago

After exploring many possibilities, it seems the only one that is:

is color jpeg2000 for archives (then usual stuff for web)

ngawangtrinley commented 4 years ago

After a few days of intense testing, here's what decided the final winner: https://github.com/buda-base/digitization-guidelines/wiki/J2K-vs-Tiff-no-compression

The final decision for images produced with Fujitsu scanners is:

Resizing is based on ། size, (for OCR min char height is 20 pixels, optimal is 40 pixels):

། height are measured on the archive images and they inform the resizing % ratio:

image Here the height is 100 pixels, which means that the resizing ratio should be 60%.

ngawangtrinley commented 4 years ago

Elie's intuition for uncompressed tifs, and everyone's lack of enthusiasm for j2k wins!

Please refer to: https://github.com/buda-base/digitization-guidelines/issues/4#issuecomment-595316008

Using a unique scan-time format avoids a lot of issues so it should definitely be kept.

Jim you might be happy to learn that we now only have two variables for the web image derivation:

This should allow the audit tool to generate the derivatives from the scanner output. The ། height represent the mean char height and can be replaced by any other frequent character for other languages.

Thanks for your input! NT

On Wed, Mar 4, 2020 at 9:30 PM Elie Roux notifications@github.com wrote:

After exploring many possibilities, it seems the only one that is:

  • usable with the Fujitsu scanner
  • usable by people on the field
  • best quality

is color jpeg2000 for archives (then usual stuff for web)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/buda-base/digitization-guidelines/issues/4?email_source=notifications&email_token=AEG3IQYKNPO3SBZVNQLOUVDRFZJZBA5CNFSM4K2226K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENX3I3Q#issuecomment-594523246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEG3IQZECNZDWLKZP4HWXGLRFZJZBANCNFSM4K2226KQ .

ngawangtrinley commented 4 years ago

An easy way to make a derivation script would be to ask field staff to add a suffix to images that need to be converted to color, something like image123x.tif for image123.tif. With this and a command line interface script that takes in the source images path + the ། height in pixels we would be good to go.

(I know, we should have figured this out 10 years ago)