Data Science Tools: NIR images

AdamDimech commented 1 year ago

I need some help, but I am not really sure what I am doing:

We have a hyperspectral camera in our LemnaTec system. I am able to use Date Science Tools (LT-db-extractor.py) to download the files, but they come out as a series of blob files, when I'd be expecting files such as:

darkReference
darkReference.hdr
frameIndex.txt
image.jpg
raw
raw.hdr
settings.txt
whiteReference
whiteReference.hdr

I have some questions:

Does LT-db-extractor only download files from a LemnaTec database as blobs, or should it also be decrypting the files into a useful format (like it can convert blobs into PNG images for visible-spectrum images)?
If the answer to 1 is no, then how would I determine what format the blobs are in to decode them.

I have attempted 2: the blobs will unzip into a folder containing a file called data. The problem then arises as to what to do with data because running file /grop/path/to/data on Linux says that data is "Arhangel archive data". I have no idea what to do with this.

This part of LT-db-extractor.py seems to be where the magic happens, but at least in my case it doesn't seem to be doing anything at all:

    raw_rescale = None
    if raw_images[image]['dataformat'] == 4:
        # New NIR camera data format (16-bit)
        if len(img_str) == (db['nir_height'] * db['nir_width']) * 2:
            raw = np.fromstring(img_str, dtype=np.uint16,
                                count=db['nir_height'] * db['nir_width'])
            if np.max(raw) > 4096:
                print("Warning: max value for image {0} is greater than 4096.".format(image))
            raw_rescale = np.multiply(raw, 16)
        else:
            print("Warning: File {0} containing image {1} seems corrupted.".format(local_file, image))

How are others managing this?

BTW I am using a customised version of LT-db-extractor. Any help would be most appreciated.

nfahlgren commented 1 year ago

Hi @AdamDimech, the program initially downloads the blob files but then should extract the raw image out and convert it to a PNG and removes the blob. When that fails it leaves the blob file behind.

Over time we have had RGB, 8-bit grayscale, and 16-bit grayscale data types on our system, so the program likely isn't doing the correct kind of conversion for your data.

I'm happy to help work out the protocol (if I can). If you can do the SQL query on line 108 of your modified program and save the results to a file we could use some of the information in the results to help. Really we just need the line corresponding to one blob file.

The things we need to figure out are the height, width, and depth (number of wavelengths/bands), the data type (8-bit, 16-bit, etc.), and the interleave type (BIL, BIP, BSQ). These assume the data are stored as a single data cube, versus individual grayscale images.

We can also get a sense of what is in the blob file by reading in the data file after you unzip it.

import numpy as np

# You can try different data types np.uint8, np.uint16, etc.
raw_data = np.fromfile("data", np.uint16, -1)

# The length should be evenly divisible by height*width*depth
print(len(raw_data))

AdamDimech commented 1 year ago

Thanks for the help @nfahlgren

I am a bit uncertain about the depth. I have attached the file as requested: H2023015-Output.csv

I know that the output files are supposed to be 16-bit. But I cannot get the mathematics to work as you describe in your piece of code given that width is 320 and height is 247 and there seem to be 229 bands, excluding the colour preview and meta information.

The specific SQL query that I ran was this: SELECT *, image_file_table.path as FilePath FROM snapshot INNER JOIN tiled_image ON snapshot.id = tiled_image.snapshot_id INNER JOIN tile ON tiled_image.id = tile.tiled_image_id INNER JOIN image_file_table ON tile.raw_image_oid = image_file_table.id WHERE measurement_label = 'H2023015' AND dataformat!=9 AND time_stamp<='2023-01-01'

The dataformat!=9 is to omit visible-spectrum images. I hope this isn't an issue?

nfahlgren commented 1 year ago

Hi @AdamDimech, the CSV file is definitely helpful. What I can see in it is that each wavelength is stored as a separate file.

Assuming it is a 16-bit image, this should work:

from plantcv import plantcv as pcv
import numpy as np

height = 247
width = 229
with open("data") as fp:
    # Raw image data as a string
    img_str = fp.read()
    raw = np.fromstring(img_str, dtype=np.uint16, count= height * width)
    img = raw.reshape((height, width))
    pcv.plot_image(img)  # alternatively use matplotlib

If you get an error from np.fromstring about not enough data, then that would suggest it might be np.uint8 instead. Or something else.

If it processes the raw image but the plot looks jumbled, then the image may still be interlaced and might need to be reshaped in a different way that we can test.

Once we figure this part out, we would need to make a PlantCV function that can read the multiple images and create a hyperspectral data structure so that you can do downstream analysis with the data cube.

AdamDimech commented 1 year ago

Thanks for this @nfahlgren.

This isn't working. An error is created at the following line: img_str = fp.read()

The error reads: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 280: character maps to <undefined>

I played with the code and this version seemed to work (insofar as I was able to get some sort of image), but note that I had to change np.fromstring() to np.frombuffer() to eradicate an error that said "DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead"

from plantcv import plantcv as pcv
import numpy as np

height = 247
width = 229
with open("c:/data/projects/hyperspec/img/unzip/data", 'rb') as fp:
    img_str = fp.read()
    raw = np.frombuffer(img_str, dtype=np.uint16, count= height * width)
    img = raw.reshape((height, width))
    pcv.plot_image(img)

Here is the result:

If I use this code, with the old np.fromstring() and "utf-16-le" encoding, I get this:

from plantcv import plantcv as pcv
import numpy as np

height = 247
width = 229
with open("c:/data/projects/hyperspec/img/unzip/data", encoding="utf-16-le") as fp: #utf-16-le
    img_str = fp.read()
    raw = np.fromstring(img_str, dtype=np.uint16, count= height * width)
    img = raw.reshape((height, width))
    pcv.plot_image(img)

Changing the dtype in the above code to uint8 results in:

This is becoming tricky. I have attached a blob file which may be of help: blob23211067.zip

You'll need to remove the .zip extension from it. Without it, I was unable to share it on GitHub.

nfahlgren commented 1 year ago

Thanks @AdamDimech, I got a bit closer using what you posted above. I realized in the SQL query results that the image dimensions vary. This must be a line scan camera so the height is variable depending on the length of the scan.

img_str has a length of 240640. Width is fixed at 320px. So from that I deduce that the height of this image must be 376, anything larger is larger than the buffer size when using uint16.

I can see the plant in the resulting image. It still looks noisy. Something is either still wrong or it's possible this individual frame actually looks like this (I think some channels can look odd on their own).

AdamDimech commented 1 year ago

Hi @nfahlgren

What an adventure. It turns out that your dataformat = 4 is my dataformat = 5 🤷

If you can shed any light on what all the various numbers mean, that's be most helpful. So far I have determined: 0 = 8-bit Hyperspectral images from your (old) system (maybe?) 2 = Colour preview (I am yet to generate this via Data Science Tools), but presumably RGB 4 = Danforth Centre's hyperspectral images 5 = My hyperspectral images 9 = Visible spectrum images 12 = Meta information (which I also need to decode, possibly)

This matters because LT-db-extractor.py was unable to grab any images at all (for obvious reasons). There is also something different with our LemnaTec database which means that LT-db-extractor.py was collecting width and height measurements from RGB images for some reason, so of course nothing worked.

Anyway, I made some changes to collect the correct heights and widths:

raw_images[image_name] = {'raw_image_oid': row['raw_image_oid'], 'rotate_flip_type': row['rotate_flip_type'], 'dataformat': row['dataformat'], 'blobpath': file_path, 'nir_h': row['height'], 'nir_w': row['width']}

So this is how my code for downloading the hyperspectral images now looks:

elif 'NIR' in image or 'nir' in image:
    raw_rescale = None
    if raw_images[image]['dataformat'] == 5:
        # New NIR camera data format (16-bit)
        nir_w = raw_images[image]['nir_w']
        nir_h = raw_images[image]['nir_h']
        if len(img_str) == (nir_h * nir_w) * 2:
            raw = np.frombuffer(img_str, dtype=np.uint16,
                                count=nir_h*nir_w)
            if np.max(raw) > 4096:
                print("Warning: max value for image {0} is greater than 4096.".format(image))
            raw_rescale = np.multiply(raw, 16)
        else:
            print("Warning: File {0} containing image {1} seems corrupted.".format(local_file,
                                                                                    image))
    elif raw_images[image]['dataformat'] == 0:
        # Old NIR camera data format (8-bit)
        if len(img_str) == (nir_h * nir_w):
            raw_rescale = np.frombuffer(img_str, dtype=np.uint8,
                                        count=nir_h*nir_w)
        else:
            print("Warning: File {0} containing image {1} seems corrupted.".format(local_file,
                                                                                    image))
    if raw_rescale is not None:
        raw_img = raw_rescale.reshape((nir_h, nir_w))
        if raw_images[image]['rotate_flip_type'] != 0:
            raw_img = rotate_image(raw_img)
        cv2.imwrite(os.path.join(snapshot_dir, image + ".png"), raw_img)
        os.remove(local_file)

The problem(s) that I am now facing are:

Some of the images seem to be 'saturated'.
I am getting a warning on about half of them saying "Warning: max value for image [name] is greater than 4096."

Here are some samples: H2023015c20r5n100_2023-06-09 02-40-58_H2023015_NIR-HS-0_100399700_1117_38

I am uncertain whether this is the same issue or not. A colleague (who is far more experienced with hyperspectral images) told me that the "saturation" is direct radiance and that I need to convert DN to reflectance. Is the warning caused by the pixel values exceeding what is permitted in uint16, or is something else at play?

AdamDimech commented 1 year ago

I have also noticed that the first two rows of pixels seem to have either come from another image or are taken from the bottom rows of the images. Any ideas what might be causing this? unordered_rows

AdamDimech commented 1 year ago

And one final question: I know that our images should be in ENVI format, but what I seem to end-up with is 16-bit PNGs. Is there a method for recovering the files from the database in ENVI format via PlantCV? That is what I really want to be able to achieve.

LemnaTec have provided a horrible proprietary tool for this, but I'd prefer to use PlantCV or an open-source tool if I can.

danforthcenter / plantcv

Data Science Tools: NIR images #1296