Internal Reviewer comments on Section 3 (Photo-z PDF Estimation Methods)

Line numbers refer to the "internal reviewer" version of the PDF attached to the Confluence page: https://confluence.slac.stanford.edu/pages/viewpage.action?pageId=238561496

As this section describes the individual codes, many authors will be editing this section. Maybe further breakdown by subsection/code is necessary? Edit: switch to breakdown by code:

BPZ and general

[x] 361: typo
[x] 358: How were these codes chosen?
[x] 371: typo in architecture
[x] Perhaps write a two sentence introduction after the headers: Template-based Approaches/Training-based Codes (Approaches).
[x] I would also perhaps introduce the Simple ensemble estimator at the beginning of the section.
[x] 410: Is a negative flux here different from a non-detection? How are non-detections treated?

General:

[x] Sec 3: 1) I would rephrase this section to sound less like an advertisement of codes’ options in some places. 2) If you’re going to cite places some codes were used, should do this for all. 3) It seems like some details are included for some codes and not others e.g., things they all must do functionally but which are only identified in some code subsections. If it’s really true that some of them don’t do some of these basic things (e.g., redshifting of templates), it’s worth pointing that out explicitly instead… 4) There are several options being turned off in the codes because of effects that aren’t included in the buzzard sim. Is this safe? There could be issues missed by forcing the codes into their simplest configurations, instead of how they’d be used for any real data? 5) Are all these codes as used for this analysis available/archived somewhere so that 1-1 comparisons in future analyses can be made? 6) In general this section needs to be homogenized - some things are said and not said for various code, and some things are needlessly repeated. It would also benefit from a section describing what is uniform between all codes (like - hopefully - the p(z) binning).
[x] 671: Are all codes using 200 linear bins in 0 to 2?
[x] 479: I would avoid superlatives for individual packages. This paper isn’t judging how many options a code has, but whether it actually works to produce unbiased photo-zs.
[x] 683: Why only introduce this distinction here? The notation in general could be tied together more, which is related to some of my other points about places where things are introduced locally in a subsection for a code, but probably apply more generally.
[x] 858: It seems that no code makes use of the fact that an object may be undetected in some band, and in some cases uses a less than transparent or realistic way of assigning a flux to the object. This seems like a huge oversight in methodology - at the very least, if you’re going to pretend an object was not a non-detection in some band, you could assign the same flux to it for every code (that isn’t a significant flux detection).

EAZY

[x] L428: “although some argue” → which is right? take a stance, since this does seem to be a question that can be unambiguously settled
[x] 429: Who? Cite or state that you disagree with the process.

LePhare

[x] l480 (sic, should be line 460): the discussion of galactic reddening seems unnecessary here

ANNz2

[x] l487: unclear what “weighted average of their performance” means
[x] l495: unclear what “uncertainty on the machine learning method” means
[x] l502: I find this paragraph difficult to understand
[x] l518: if non-detections are “looked, but not seen” then this is an obvious mistake
[x] 491: runon sentence

CMNN

[x] l523: \citet[][hereafter G18]{Graham}
[x] l541: I find this part difficult to understand.
[x] L556: how do you treat the non-detections when finding nearest neighbors?
[x] L600: accuracy or precision?
[x] 532: Would benefit from an equation here.
[x] 556: Why don’t you use non-detection information and instead force a (wrong?) detection in each band? This ties to question 410 above. Do any of these photo-z methods have a method for utilizing a non-detection explicitly?
[x] 600: What is the difference between ‘robust’ and ‘accurate’?

Delight

[x] L616: explain latent at first use
[x] l667: 50,000 is different from the number given in the introduction of the training sample
[x] l670: if non-detections are “looked, but not seen” then this is an obvious mistake
[x] l675: a flat prior on what?
[x] 667: You said earlier there were fewer than 50k training galaxies at the end - where is 50k coming from here? What training set did it use?
[x] 670: Surely setting non-detections to a clearly non-zero value is just wrong, even if you allow it’s ok to set it to the detection threshold (as earlier used)? This seems like quite a big discontinuity in the analysis between codes. Why isn’t it homogeneous?
[x] 671: typo

FlexZBoost

[x] L688: I don't get that – in particular I don't get how this converts a conditional mean estimator to a conditional pdf
[x] l708: I understood this as meaning that if you assume the PDF is a Fourier series with 35 coefficients then it's fully described by 35 coefficients, which seems like a truism. Maybe you mean something else?
[x] 692: Is it worth saying this? The data set in this paper is already quite small even for stage II standards…
[x] 700: Use consistent names for the ‘training set’ you’re referring to.
[x] 706: I’m not sure what the point of this statement is? You think there are real features that one could reproduce with a wide-field photo-z code with finer resolution than 0.01 in redshift? I think the tendency to praise a photo-z method by how many statistical bells and whistles it incorporates (that you could employ yourself - everyone can do a fourier representation of a function…) is detrimental to actual progress in solving the fundamental issues we face with photo-zs.

GPz

[x] L768: if non-detections are “looked, but not seen” then this is an obvious mistake
[x] L787: this sounds like a mistake that is easy to avoid in practice
[x] 762: What variables are being discussed here?
[x] 788: Why was this done or why was it not fixed? Did any of the other codes use a different training set (besides 667 above)? This seems to undercut the whole point of writing such a focused and limited paper, to homogenize and idealize the process across codes for 1-1 comparisons.

METAPhoR

[x] 3.2.6: a large part of this description does is, while interesting, I think not very relevant to the test; suggest to shorten / remove and more clearly describe any assumptions that may cause biases because they are not met by the data
[x] l860: does this mean the subset of galaxies with no detection in, say, g has a p(z) estimated from a METAPhoR system trained only on galaxies that also have no detection in g?
[x] 810: Typo
[x] 828: Do you mean ‘law’ here? Can you provide a reference to the law?

SkyNet

[x] L889: how are these values chosen?

TPZ None

TrainZ

[x] 3.3: \mathrm around the subscripts; you could mention that trainZ has a simple generalization where you just split the samples by observable properties somehow

LSSTDESC / PZDC1paper

Internal Reviewer comments on Section 3 (Photo-z PDF Estimation Methods) #21