STScI-Citizen-Science / MTPipeline

Pipeline to produce CR rejected, astrodrizzled, png's of HST WFPC2 solar system data.
6 stars 1 forks source link

Revise png creation and create previews for new inputs #129

Closed ktfhale closed 10 years ago

ktfhale commented 10 years ago

The png creation is a time-consuming step in the pipeline. We should see what we can make faster about it.

First, I'll change the functions called within a class's methods to be the methods themselves. Then, I'll look for spots in the code (especially double-nested for loops) where we could take advantage of built-in numpy features.

ktfhale commented 10 years ago

I've edited run_trim.py so that most of the one-liner class methods have been replaced by the functions they called. I didn't do this in every case the class method trim(), which called subarray(), was unchanged, as subarray() is called in another function to act on an array other than self.data.

As one would expect, these changes have made no performance impacts. Both the master and revise_PNG branches take, with two cores on my laptop, 9 minutes and 33 seconds to produce all the pngs for a single WFPC2 input (a c0m.fits' andc1m.fits` pair with all the attendent output files from CR rejection and astrodrizzling. I'll start looking for ways to speed this up.

Also, both versions encountered this error:

/Users/khale/git/MTPipeline/mtpipeline/imaging/run_trim.py:292: RuntimeWarning: invalid value encountered in divide
 self.data = (self.data / self.data.max()) * 255.

This seems kind of alarming, because you would hope that self.data.max() is never 0.

ktfhale commented 10 years ago

I'll note that we're still using pyfits here- do we want to think about eventually exchanging that for astropy.io? The IO step isn't all that important at the moment, though. We first need to make sure that the numpy arrays are being handled correctly.

Also, we'll eventually need to update get_fits_data() anyway (and run_trim(), which calls it, to deal with data from multiple extensions.

ktfhale commented 10 years ago

I have some questions about the goal of saturated_clip(). The function looks at the weight map (given by a _wht.fits file, an output of astrodrizzle?), and everywhere the weight map has a value of 0, it replaces the pixel value in the _sci.fits array with the 3x3 median of its surroundings.

The purpose of this seems to be to replace meaningless data with meaningful information from its environment, for the benefit of the png preview images? I don't know what these png's are for besides CosmoQuest and being preview images.

Anyway, this makes sense for astrodrizzle outputs like u5fz0707r_cr_c0m_center_single_sci.fits, and u5fz0707r_cr_c0m_center_single_wht.fits, viewed through ds9 below:

screen shot 2014-07-28 at 4 04 23 pm

This weight map actually doesn't have any 0 values. Its minimum value is 39. run_trim() produces the outputs for this image in about 6 seconds.

In contrast, run_trim() takes 2 minutes, 36 seconds for u3b10302p_cr_c0m_wide_single_sci.fits and u3b10302p_cr_c0m_wide_single_wht.fits. The profiler reveals almost all this time is spent in saturated_clip. About a million averages are taken, because there are about a million pixels with a value of 0 in the weightmap provided by _wht.fits:

screen shot 2014-07-28 at 4 14 09 pm

All of the pixels off the edge of the chip in both the _sci.fits and _wht.fits are 0, so, to produce pngs from this image, we spend two minutes averaging 0s. Obviously this is not desireable.

Getting rid of saturated_clip(), the 9 minute execution of the pipeline mentioned above now takes 18 seconds. Is this something we can get away with? What is the purpose of averaging the pixels around useless data, just for the png's? How prevalent are these (to my eyes, kind of unusual) choices for png frames for which huge swaths of the png have no CCD data?

ktfhale commented 10 years ago

This is really only a problem with the full _wide_ images (not the _wide_single_ images), but the saturated_clip() step for creating _wide_ images take up the vast majority of PNG creation time. Not all of _wht.fits weight maps are as bad as the one I happened to pick as my test case. I took a gander through the _wht.fits files produced by astrodrizzle in my 15 image WFPC2 test set.

The best case _wide_ weight maps are like this:

screen shot 2014-07-28 at 4 48 32 pm

There's still a fair amount of black pixels along the edges of chip 1, as well as in the "pixels" between the chips. These all have values of 0, and so will be averaged around by saturated_clip(). The worst cases are these:

screen shot 2014-07-28 at 4 48 15 pm

The majority of the weight map has a value of 0. Running run_trim() on this and its corresponding _sci.fits image alone takes ~6 minutes. Without the saturated_clip() processing, it takes 3 seconds.

ktfhale commented 10 years ago

Here's an example of the difference from running saturated_clip:

u3b10302p_cr_c0m_wide_single_sci_saturated_clip_stat

I'm actually kind of surprised at the big million-count spike in the delta around 0. Average's of 0s should still yield 0, unless there's some weird floating point math going on that's making it slightly nonzero? Doing this manually, that doesn't seem to be happening.

ktfhale commented 10 years ago

One concern with removing the saturated_clip() step from the PNG creation is that we won't be getting rid of high pixel values, and thus the 0-255 scaling will have only those pixels be set to something not pitch-black.

I'm not sure this will happen. Cosmic ray rejection should have already dealt with super-high pixel values that stand out from their environment. There could still be plenty saturated pixels left over, but they should be similar to their environment (since CR rejection didn't catch them), and in which case setting the pixel value to the average of their environment won't really help.

I'll take a look tomorrow at what happens, over all six detectors, if we eliminate saturated_clip() from png creation.

ktfhale commented 10 years ago

I'm rebasing the revise_PNG branch for this ticket onto more_drizzle, so that the pipeline can recognize the AstroDrizzle products from the new inputs.

ktfhale commented 10 years ago

We've decided to get rid of the saturated clipping, and entirely eliminate the center images from our pipeline. This will happily cut down on png creation time. I'll make the necessary changes to the filename handling, the tests, and the imaging_pipeline() to eliminate the center frame.

I'm hoping to use something like 16/24/32 cores on science4 to produce the pngs. With that, it shouldn't take more than a day or two.

At the moment, I'm testing seeing whether the linear- and the log-scaled pngs together will suffice to provide at least one decent preview image for the vast majority of our data. If they don't, then I'll investigate ways to chop off, say, the 100 brightest and darkest pixels from the histogram, and scale into the remaining. That could potentially be expensive, however. Finding a singular maximum is fairly cheap, but repeating that operation 100 times is getting less so. Obviously we don't want to sort the entire image- we only need the tails of the histogram.

acviana commented 10 years ago

You broke the build. Pushups.

ktfhale commented 10 years ago

I'm running png creation on my test lump of images (which I've put in /astro/mtpipeline/testlump). It's occasionally hitting divide by zero errors when it takes the log of self.data. I'm guessing that happens when we try to take the log of a negative value. I know there's a positive shift function in run_trim, I'll see if we're using that.

ktfhale commented 10 years ago

So far, I'm very happy with outputting both linear- and log-scaled images. For every image I've seen so far, at least one of images always provide a good preview for the FITS file.

For the WFPC2 images, it's almost always the linear preview that provides the better fit. But for the instruments its more of a tossup. For our fainter fields, like for objects like Haumea or Pluto, the log-scaled preview is superior:

screen shot 2014-08-05 at 10 41 13 am

But even for a somewhat closer, yet still small object, like Ceres, the linear image is superior:

screen shot 2014-08-05 at 10 44 19 am

If it's easy for the archive to serve up both, then I don't have any significant compunctions against doing so. But I also think it might be pretty easy to use our target name parsing to make the choice for us. For everything in the inner system I've seen, and even for objects out to, say, Saturn, I haven't seen a case where the linear image is bad. For objects farther out, even for Uranus or Neptune, I think the log preview has advantages.

Our planets and moons list is already divided up by primary body. I think it would be pretty easy to have the linear image be used for every planet (not pluto), and use the log image for everything else, and everything that can't be identified.

I think we could start the png creation, and apply the target name script that will choose which scaling to use as the preview image afterwards, when we're renaming files. Although we definitely would want to put this decision logic into the pipeline proper eventually. But I got the impression last week that some of Wally's work is waiting on me to deliver to the preview images.

Alternatives: we could just serve up both, or I could look into a clever way of improving the linear scaling, like lopping of a fixed number or a fixed percentage of the brightest pixels from the scaling. But, in the cases where the log scaling is better, I think any improvements we could make to the linear scaling would only approach the utility of the log scaling. The only reason for trying to make the linear scaling more clever would be to keep all of our images going down the same "path" in the pipeline.

There are still those troublesome divide by zero errors though, so I'm going into look into those.

ktfhale commented 10 years ago

We've decided that we'll produce both the linear and log scalings, and serve both to MAST. We'll have to consult with them how to handle the filename conventions.

Also, the divide by zero errors appear to be occurring on images with very negative minimum pixel values. We cut all those out when we do CR rejection, but they stay in the non-rejected AstroDrizzle outputs. For instance, ibr022dwq_flt.fits has a minimum pixel value of -40 billion, and a maximum of 18 billion. After drizzling, the minimum and maximum become a more modest -60 million and 140 million, respectively. There are actually a reasonable number of these incredible pixels in the image, as the histogram shows.

screen shot 2014-08-05 at 12 06 49 pm

These superbly negative and positive pixels are all in a line right one pixel below the chip gap. The pixels actually along the chip gap are all reasonably normal values.

screen shot 2014-08-05 at 12 08 31 pm

I'm going to try and find out how many of our images have these preposterous minimum and maximum values.

ktfhale commented 10 years ago

I've looked into how many of our _flt.fits images have these absurdly high pixel values. 1,145 images have either a minimum or a maximum below or above -1e6 or 1e6.

It's clearly a systematic problem. Here's a plot of the minimums and maximums for each image, vs their image number:

bads

From the looks of things, we might have only a few bad eggs (albeit bad eggs with minimums and maximums in the trillions). Everything past the 5,000th file is more or less fine, in fact. But for the first 5,000 images, if we zoom in, and look at just the maximums:

bad_side1

This seems a reasonable spread, and most of the maximum values are less than a million. But if we zoom out by an order of magnitude along the y-axis,

bad_side2

We start to see some images with maximums greater than 1e6. But, if we zoom out again,

bad_side3

There's a whole spike of images which have maximums in the tens of millions. Going out once more by two orders of magntiude,

bad_side5

That spike collapses in the face of images with values in the billions. The story's basically the same for the minimums.

And I don't think this is the case of a single bad pixel. I think a lot of these are probably due to edge effects, which results in a whole lot of insane pixels in an image. So it's almost certainly not worth it to try and just take off a certain number, or percentage, of bad pixels. Rather, I think we should impose a cut that sets all pixels above and below a certain magnitude to 0, or to that magnitude.

I've already tried it on the image mentioned in the previous comment using a 1e6 threshold, but that didn't have much effect. The new minimum was -960,000. I'd like to push the threshold lower. For the png creation, I'm going to try setting every pixel below -10 to -10, and any pixel above something like 200,000 to 200,000.

EDIT: Actually, if I'm going to set everything below -10 to 0, I might as well just set every negative value to 0, and save some processing time by obviating the positive() step.

ktfhale commented 10 years ago

Using the -10,2e5 thresholding described above, the results on ibr022dwq were not inspiring.

The linear scale: ibr022dwq_cr_wide_single_sci_linear

The log scale: ibr022dwq_cr_wide_single_sci_log

Features are starting to appear in the log scale, although in both images the only thing you can really see is the bad pixels along the chip gap.

I'm wary of adjusting the thresholding so that this image looks good, and then ruining everything else, but we'll see how it goes with a 0,1e5 threshold. I think I recall seeing images with pixel values over 10,000 in actual targets, so this would be a very bad idea for actual science images. But I'm going to see if it noticeably damages the preview images.

ktfhale commented 10 years ago

Things are looking good for ibr022dwq using a 0.001, 2e5 thresholding cut, followed by making a linear and log scaled image. The linear image is still almost entirely black and useless, but you can at least see things on the log-scaled image.

ibr022dwq_cr_wide_single_sci_log

I'm going to try running this on on a large test set. If either the linear or the log images look okay for those, I'm going to run it on our entire archive.

ktfhale commented 10 years ago

The 0.001 to 200000 thresholding, along with linear and log scaling, seem to produce adequate preview images. To summarize, if there's not a planet in-frame, you effectively never want the linear image. However, it turns out, even if there is a planet in-frame, the log image is sometimes still preferable:

screen shot 2014-08-05 at 4 15 53 pm

In this WFC3/UVIS image, the chip border on the upper right has a lot of insane pixels. The thresholding helps, but only the log scale is able to visually capture the resulting range. Also, most of the insane pixels seem to be along the chip gaps of UVIS images.

I think I can run the pipeline over everything with these settings. Almost nobody's using science4 at the moment, so I think I can get away with 32 cores.

EDIT: Never mind, one of my fellow interns is using all of science4.

ktfhale commented 10 years ago

I left the pipeline running overnight with 16 cores on science4. Unfortunately, looks like we may have hit our disk quota on central storage. The astro/mtpipeline/ folder is currently 3.6 TB in size. I can get that down by deleting the redundant cr-rejected files I created in /mtpipeline/archive.

ktfhale commented 10 years ago

I've deleted those files. I also just realized that the cr-rejected files I created in /mtpipeline/archive/wfpc2 are not exactly redundant. They were produced by cosmicx, whereas the cr-rejected products in /mtpipeline/archive/wfpc2/ were produced before I arrived, by cosmics.py. So I've inadvertently deleted the cosmicx versions of the WFPC2 cr-rejection files.

I ought to have consulted with @acviana before I did that. But I had the impression that we're happy happy not reprocessing the WFPC2 outputs from the older version of the pipeline. If we wanted to reprocess them, so that cr-rejection is done with cosmicx along with everything else, we could.

... and that bought us only 0.3 TB. /astro/mtpipeline is currently at 3.3 TB.

acviana commented 10 years ago

How much space would we save by deleting all the center images?

ktfhale commented 10 years ago

Actually, I've just discovered the disk quota problems aren't coming from central storage. Rather, my science cluster disk quota was full, and my logging directory is on my science account, not central storage. So the logfiles couldn't write.

Woops. This hasn't been one of my finer mornings.

ktfhale commented 10 years ago

With the exception of the 83 known bad files documented in Ticket #134, png preview images have been created for all of our input files. I think this ticket can be closed.