Metric calculations for constant shear branches

barnabytprowe commented 10 years ago

This is an email received today, to which the authors very kindly said I could respond publicly here at great3-public:

We, the amalgam@iap team, have been spending a lot of time on the challenge playing around with weighting schemes and so on, in order to understand why we were doing so well on some fields and so bad on some others. We're not doing so badly after all but I'd like to have some precision on the way you calculate c and m values before inferring score if this is not considered an unfair request.

Is the fit of the g_supplied - g_true regression performed in a least square sense or is it a more robust mean absolute deviation minimization? Do you clip some outlyier fields off? Why don't you allow us to provide errors (or weights) for each field, if done in a least square sense? Of course this all applies to constant shear branches (either control or real_galaxy).

Responses below...

barnabytprowe commented 10 years ago

Is the fit of the g_supplied - g_true regression performed in a least square sense or is it a more robust mean absolute deviation minimization?

It's a least squares sense, with equal weight given to all 200 fields. As mentioned in the handbook, we rotate the shears for each field into the coordinate system in which +v g_1 is aligned with the direction with the ellipticity of the PSF (as estimated using weighted moments on the noise free PSF images).

Do you clip some outlyier fields off?

No. There is no truly significant variation in the data properties from field to field within a branch, and so outlier rejection never occurred to us as being appropriate. As I understand it, removing outliers is sensible in real data as there can be a number of reasons why a sub-population of your data points don't match the assumptions you wish to make. This isn't the case here, the same basic assumptions hold for all the input data, so if there are strong outliers this must come from some aspect of the treatment of the data. I think it's a lot less certain that outlier rejection is appropriate in that case.

(And so far, in the individual submissions we've looked at there hasn't been a high incidence of outlier fields.)

Why don't you allow us to provide errors (or weights) for each field, if done in a least square sense?

Reasons related to the point I made above: because strong variations in results from field to field are not expected from the data, since the data properties are broadly very similar between fields. Although the PSF does vary in size and shape to some extent, the galaxy SNR and size distributions only vary slightly (in their low edge cutoff, as described in the handbook) and other properties of the images don't change (e.g. noise model). Now, branch to branch this is not true, but this is part of the point of the experiment.

In addition, given that we have not varied the data properties strongly between fields, there is some motivation for rewarding methods that give consistently good performance for all 200 fields. Combined with there being no reason a priori to upweight some fields and downweight others since the assumptions that you can make about the data are explicitly clear from the description in the handbook, it never occurred to us that fields ought to be allowed to be weighted.

There might well be something we are overlooking though. If anyone does see big variations in the data properties between fields, please let us know.

@rmandelb do you have anything to add?

In one further question...

in order to understand why we were doing so well on some fields and so bad on some others.

A question to the author of the original question: how is it that you quantify how well you are doing in certain fields? We are only providing metrics for different branches (e.g. rgv, rgc, cgc etc.) so is your identification of outliers in fields based on looking at the catalogues or submissions themselves before you submit them? Rather than downweighting these fields, there might be advantage in working out what about those images makes the method behave more erratically than in other images...

rmandelb commented 10 years ago

I only have one thing to add to this very thorough answer:

When we are doing science (i.e. trying to understand the methods well enough to write a paper that clearly explains the results of the challenge) instead of just providing a single number to rank methods, we can explore things like "which methods produce outliers that mess up the least-squares fitting?" It's not clear that we'll have some indicator in real data that those are outliers and so removing them here is not very well-motivated, but it will at least be useful information for people who use those methods that could be used to improve them.

rgavazzi commented 10 years ago

Thanks guys for your quick answer. This is the best place to follow up on this issue, I reckon! I fully appreciate that there shouldn't be too much field-to-field variation of the accuracy if SNR is matched and the number of objects is kept constant. This stops being the case if for some reasons there is no exactly 100x100 objects, no more no less, in a given field or if the PSF is substantially larger or pathological. At least in our method, we report consistently larger errors on ellipticities for those fields affected by the latter issue and we could also account for the fact that some broken pairs of rotated twins could lead to a noisy imperfect shape noise cancellation in some other fields (fortunately, we are no longer facing this problem though!).

Chopping off fields we don't like is not something one can afford in real life (well... one could argue that some PSFs would just deserve going to the trash bin...) but I agree the dependence on the PSF is part of the experiment. I confirm we find noticeable variations for ground PSFs. This may not be a a level that could completely screw a least square fit up but well...

What I meant by "doing well on some fields and bad on some others" is more related to the stability of our findings for each field. By doing well on a field, we could well have been doing steadily wrong on it. Sorry about this confusing statement.

Fully agree with your last sentence. This is why I wanted us to be on the same footage :) !

rmandelb commented 10 years ago

Hi Raphael - I just want to comment on two things:

(1) We do account for the larger PSF when choosing a noise variance for a given field, so in principle the S/N distributions should be similar regardless of the PSF size. (Yes, this is not like real data, but it seemed important for the test we're trying to do here.)

(2) One of the planned tests for after the challenge ends is to split the ground-based branches into the 50% best and worst seeing and see if the performance for each method differs in a statistically significant way for these two cases. This is one of the results we would want to include in the final paper.

barnabytprowe commented 10 years ago

At least in our method, we report consistently larger errors on ellipticities for those fields affected by the latter issue and we could also account for the fact that some broken pairs of rotated twins could lead to a noisy imperfect shape noise cancellation in some other fields (fortunately, we are no longer facing this problem though!).

I see, yes that is a possibility. We tried to mitigate against this by having cutoffs in the size and SNR distributions of galaxies that reflect the size of the PSF - e.g. so that there are no super small objects. But imperfect shape noise cancellation is unavoidable if we want to get to at least semi-realistic galaxy size and SNR distributions, and it totally makes sense that this is happening for the big/ugly PSF cases.

If it makes you guys feel any better, for the science paper we do plan to split ground based simulation fields by something like seeing and compare results - it sounds like you may do significantly better in the good seeing images!

confirm we find noticeable variations for ground PSFs. This may not be a a level that could completely screw a least square fit up but well...

It might screw it up, it depends on the strength of the outlier as it gets equal weight. We will have to see when we split further in the science paper. We did try to make variations in seeing and PSF properties realistic, and motivated by real atmosphere and optical models. The handbook goes into quite a bit of detail about this. I think it's not unfair to implicitly reward methods that can handle this sort of variation, which I think the current setup does.

Still, even though we have tried to be realistic about the PSF model variation there are occasional freaks when you have so much data split into so many fields and PSFs (e.g.: https://github.com/barnabytprowe/great3-public/issues/6 ).

What I meant by "doing well on some fields and bad on some others" is more related to the stability of our findings for each field. By doing well on a field, we could well have been doing steadily wrong on it. Sorry about this confusing statement.

I see, no worries!

Fully agree with your last sentence. This is why I wanted us to be on the same footage :) !

I hope we've made the situation clear. We really are doing nothing fancy at all, in fact the most basic thing you can imagine: equally weighted least-squares regression. Strong outliers will therefore hurt scores!

barnabytprowe commented 10 years ago

Please note, all the metric evaluation code is now public on great3-public!

barnabytprowe / great3-public

Metric calculations for constant shear branches #13