feature request: handle fits.gz files

teuben commented 4 years ago

Some archives deliver their data in fits.gz format, in the case I was using, they were able to get the sizes down to 50%. Both fv and ds9 handle this transparently, and we're going to miss the boat of community acceptance if carta doesn't handle this transparently. Yes, this means I had a case where I think carta does a better job than ds9, but I resorted to ds9 because I had too many files and didn't want to uncompress them. In ds9 it's 8 steps (clicking/selecting) to get a spectrum on the screen. In carta this is 1 step (arguably I would love to see that go to 0, if you are loading a cube, why give people an X and Y scan on the right, but not the more obvious Z scan, or give all 3)

kswang1029 commented 4 years ago

You can make the default layout to “cube view” which has x, y and z scan. Or you can design your own layout and save it and make it as default via the preferences dialogue.

kswang1029 commented 4 years ago

In addition, we have a plan to implement an option called “smart” layout which loads user-defined layouts based on image dimension/type.

veggiesaurus commented 4 years ago

Some archives deliver their data in fits.gz format, in the case I was using, they were able to get the sizes down to 50%. Both fv and ds9 handle this transparently, and we're going to miss the boat of community acceptance if carta doesn't handle this transparently.

I agree that we should support this. Unless I'm mistaken, compressed FITS files are handled correctly by cfitsio. So we might just need to investigate what's stopping casacore from supporting this. @teuben can you send us an example file? I remember trying some fits.gz files that worked properly in CARTA, and others that didn't.

teuben commented 4 years ago

ok, here's a good challenge. I have two files in my https://www.astro.umd.edu/~teuben/data/ The cube is hidden as an image in the first extension, the m0 file is a true classic fits file. Salient detail: ds9 will not correctly overlay (match WCS) these two, I will need to contact the ds9 author about this. These data are from Hershel, they have a pretty complex hierarchy of fits.gz files, where some fits files are tables referring to the datafits files. nutty.

pford commented 4 years ago

Moved issue from carta to carta-backend

pford commented 4 years ago

Giving an estimate of 0 story points since it is a duplicate of #384, but keeping this ticket for test images.

veggiesaurus commented 4 years ago

@pford I'm not too clued up on all the FITS conventions, but as far as I know, there are actually two conventions here: one is literally a FITS file that has been gzipped, and another is a fits.gz file, which is slightly different, based on the examples that we have been provided by users.

teuben commented 4 years ago

@veggiesaurus they really better be exactly the same. I cannot imagine there being two types of those.

pford commented 4 years ago

@veggiesaurus @teuben the CASA image opener recognizes .fz files created with fpack as FITS, but .gz is UNKNOWN type so CARTA will have to handle these file types differently. I do not see anything in the casacore code which recognizes and supports gzipped files. Boost supports this but is not currently a required library. I will investigate C++ STL next, perhaps std::filesystem.

Kechil commented 3 years ago

Removed duplicate to prevent the issue being closed and because it has some use.

kswang1029 commented 3 years ago

test images are available via #682

kswang1029 commented 3 years ago

Based on @pford it appears that we will need to decompress the fits.gz file into RAM before we can access the image data. The implies there is no scalability for large images. Also based on @veggiesaurus, there appears a way to get the decompressed size of a fits.gz without an actual decompression. The info may be available from the file header of the fits.gz. So a possible solution is we set a threshold for the decompressed size so that only those fits.gz with decompressed size less than the threshold can be loaded in CARTA. This should still support such use case to some extent without messing up the resource management of a server.

This might be useful

$ gunzip -l syslog.1.gz
     compressed        uncompressed  ratio uncompressed_name
        4465670            33295551  86.6% syslog.1

Would this work? @veggiesaurus @pford @teuben @Jordatious @jott3077

jott3077 commented 3 years ago

any option to write into a temporary disk space when too big for RAM?

veggiesaurus commented 3 years ago

Based on @pford it appears that we will need to decompress the fits.gz file into RAM before we can access the image data. The implies there is no scalability for large images. Also based on @veggiesaurus, there appears a way to get the decompressed size of a fits.gz without an actual decompression. The info may be available from the file header of the fits.gz. So a possible solution is we set a threshold for the decompressed size so that only those fits.gz with decompressed size less than the threshold can be loaded in CARTA. This should still support such use case to some extent without messing up the resource management of a server.

This might be useful
$ gunzip -l syslog.1.gz
     compressed        uncompressed  ratio uncompressed_name
        4465670            33295551  86.6% syslog.1
Would this work? @veggiesaurus @pford @teuben @Jordatious @jott3077

How is this different from loading up a very large 2D image?

kswang1029 commented 3 years ago

Based on @pford it appears that we will need to decompress the fits.gz file into RAM before we can access the image data. The implies there is no scalability for large images. Also based on @veggiesaurus, there appears a way to get the decompressed size of a fits.gz without an actual decompression. The info may be available from the file header of the fits.gz. So a possible solution is we set a threshold for the decompressed size so that only those fits.gz with decompressed size less than the threshold can be loaded in CARTA. This should still support such use case to some extent without messing up the resource management of a server. This might be useful
$ gunzip -l syslog.1.gz
     compressed        uncompressed  ratio uncompressed_name
        4465670            33295551  86.6% syslog.1
Would this work? @veggiesaurus @pford @teuben @Jordatious @jott3077
How is this different from loading up a very large 2D image?

I guess we have no idea on how many channels the fits.gz file contains. For example, if it is a single channel image with 16000x16000 pixels, it is ~1GB in size (after decompression) but if it is 16000x16000x1000(channel) then it is 1TB.

It is true that at the moment we can try to load a large single-channel FITS/CASA/HDF5 image with a file size more than the physical memory size. So maybe the backend should check the available free memory (or total system memory) before loading an image and show warnings when proper as a general treatment for loading extremely large images. 🤔 If we don't set an upper limit, then I can make a gz file of a large cube and load it into CARTA (all channels in RAM) and enjoy best performance 😂

veggiesaurus commented 3 years ago

The only thing I really worry about are maliciously created "zip bombs" that decompress from very small files into infinitely large ones.

veggiesaurus commented 3 years ago

Actually, it would be possible to simply read the first channel of the image. It's just random access that is tricky. We could read the channels sequentially if necessary.

The process would be:

Decompress the first N to read and parse the header. Then determine the slice size, and decompress the remaining file up to the size of a slice. If a user wanted to do anything involving random access, then we'd need to decompress the whole thing

teuben commented 3 years ago

A few considerations thinking about things that affect decisions 1) i just learned this week there is a funky way to compress fits files using the fpack/funpack tool that on e.g. ubuntu is in a package called libcfitsio-bin. This will give the fits file a new extensions, ".fz". I tried a 32 MB fits cube, where gzip made it 30MB, their fpack program made is 5.1MB !!! I noticed that ds9 supports this format. fpack/funpack don't seem to work in pipes, so at best memory maps can be used? [ edited: I see the .fz extension is in the sample set now]

2) if you decide to use a temporary directory if decompression doesn't work in memory, there should be some method by which the user can override where this happens on disk. MIRIAD uses $TMPDIR (often used in other packages), which you can set to /tmp if it doesn't exist. Or maybe carta has some .carta.rc files somewhere. As a user I would also appreciate that this policy is made visible in a log or on-screen, so we have a teachable moment (git is good at that)

3) Our own centos system is kinda of nasty, /tmp is tmpfs and part of memory. On ubuntu /tmp is part of / - memory is /dev/shm for ubuntu. I happen to prefer that, so the user has a choice of stealing from memory, root or their own drive for $TMPDIR. I dont know if that's the default centos (redhat?) philosophy now, or if that's our sysmgr. I recall on Solaris we also have /tmp part of memory, as it was a tmpfs.

veggiesaurus commented 3 years ago

A few considerations thinking about things that affect decisions

i just learned this week there is a funky way to compress fits files using the fpack/funpack tool that on e.g. ubuntu is in a package called libcfitsio-bin. This will give the fits file a new extensions, ".fz". I tried a 32 MB fits cube, where gzip made it 30MB, their fpack program made is 5.1MB !!! I noticed that ds9 supports this format. fpack/funpack don't seem to work in pipes, so at best memory maps can be used? [ edited: I see the .fz extension is in the sample set now]

if you decide to use a temporary directory if decompression doesn't work in memory, there should be some method by which the user can override where this happens on disk. MIRIAD uses $TMPDIR (often used in other packages), which you can set to /tmp if it doesn't exist. Or maybe carta has some .carta.rc files somewhere. As a user I would also appreciate that this policy is made visible in a log or on-screen, so we have a teachable moment (git is good at that)

Our own centos system is kinda of nasty, /tmp is tmpfs and part of memory. On ubuntu /tmp is part of / - memory is /dev/shm for ubuntu. I happen to prefer that, so the user has a choice of stealing from memory, root or their own drive for $TMPDIR. I dont know if that's the default centos (redhat?) philosophy now, or if that's our sysmgr. I recall on Solaris we also have /tmp part of memory, as it was a tmpfs.

.fz images are lossy-compressed images, so the impressive compression ratio is somewhat expected. They're very useful, though, and I'd imagine it would be easier to support these images, because ctfisio handles this all transparently.

CARTAvis / carta-backend

feature request: handle fits.gz files #648