LSSTDESC / rail_gpz_v1

RAIL-wrapped version of Peter Hatfields pure python implementation of GPz-v1
MIT License
0 stars 1 forks source link

PDFs for the GPZ algorithm #16

Closed hdante closed 5 months ago

hdante commented 5 months ago

Hello, when estimating with the GPZ algorithm, the output file doesn't contain a matrix for the probability density functions, only the z mode. Should it include the PDFs ?

HDF5 "out_gpz2.hdf5" {
GROUP "/" {
   GROUP "ancil" {
      DATASET "zmode" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) }
      }
   }
   GROUP "data" {
      DATASET "loc" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) }
      }
      DATASET "scale" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) }
      }
   }
   GROUP "meta" {
      DATASET "pdf_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      DATASET "pdf_version" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
   }
}
}

Before submitting Please check the following:

eacharles commented 5 months ago

The pdfs are in the data group, store using qp.

-e

On Jun 5, 2024, at 1:27 PM, Henrique @.***> wrote:

Hello, when estimating with the GPZ algorithm, the output file doesn't contain a matrix for the probability density functions, only the z mode. Should it include the PDFs ?

HDF5 "out_gpz2.hdf5" { GROUP "/" { GROUP "ancil" { DATASET "zmode" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) } } } GROUP "data" { DATASET "loc" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) } } DATASET "scale" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 2437615, 1 ) / ( 2437615, 1 ) } } } GROUP "meta" { DATASET "pdf_name" { DATATYPE H5T_STRING { STRSIZE 4; STRPAD H5T_STR_NULLPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 1 ) / ( 1 ) } } DATASET "pdf_version" { DATATYPE H5T_STD_I64LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } } } } } Before submitting Please check the following:

I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem. I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead. If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list. — Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/rail_gpz_v1/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIVHJPUP6VXMPLJE7GDZF5YBXAVCNFSM6AAAAABI3NVDJCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMNZXGQ4DIMY. You are receiving this because you are subscribed to this thread.

sschmidt23 commented 5 months ago

GPz v1 outputs a single Gaussian as the PDF estimate for each galaxy, which can be stored as just a mode and width. It is stored as a qp Ensemble. If you wanted to convert the Gaussian to a grid representation or something else then you can do that as a post-processing step.

hdante commented 5 months ago

Hello, thank you for the explanations, I'll ask the team if they will need the explicit PDF, if they don't then this issue can be closed as invalid. If they need the PDF to treat the output file uniformly compared to other estimation algorithms, would you be willing to add this option to generate it in GPZ's estimation ?

eacharles commented 5 months ago

Technically the pdfs are being generated. It is just that they are being represented as simple gaussians. If you read the file with qp, you can get the values on whatever grid you want, for example:

ensemble = qp.read(îile) pdf_values = ensemble.pdf(np.linspace(0, 3, 301.)

On Jun 6, 2024, at 1:51 PM, Henrique @.***> wrote:

Hello, thank you for the explanations, I'll ask the team if they will need the explicit PDF, if they don't then this issue can be closed as invalid. If they need the PDF to treat the output file uniformly compared to other estimation algorithms, would you be willing to add this option to generate it in GPZ's estimation ?

— Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/rail_gpz_v1/issues/16#issuecomment-2153386337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIRQTP4KGFBNPAEQLJDZGDDTTAVCNFSM6AAAAABI3NVDJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJTGM4DMMZTG4. You are receiving this because you commented.

hdante commented 5 months ago

Sure, but can you add an option for them to be written during the estimation ?

eacharles commented 5 months ago

So, I not sure I'm understanding what you are asking for. I guess I'm missing some context as to why you want the pdf values evaluated on a grid. Is it for visualization, or to put into a table, or is that just what is expected?

For context as to why I'm asking:

For GPZ the PDFs are being written, just as Gaussians instead of as intropolated grids or histograms. The means and widths are in the data/loc and data/scale DATASETs respectively. Since this is the representation of the PDF that gpz produces, this is really the "best" way to present this information. If someone want to put this on a grid, e.g., for visualization, they can either use qp to read the data and compute the values on the grid, or, pretty easily from the stored values themselves.

Yes, we could store the values on the grid instead, but then we would have to know what grid you want the values stored on, and if you wanted a histogram or an interpolated grid. And I'm not really sure that storing a grid of 300 number makes much sense when you have more precise information just from storing 2 numbers.

Are you sure that you can't just what is there?

hdante commented 5 months ago

Hello, Eric, right now I need the response to forward the upstream's opinion to the LineA team. I have already understood that the PDFs are Gaussian-shaped and the 3 Gaussian parameters are saved and nothing else. I imagine there are a few reasons we might need the PDF grids, in particular due to CPU accounting and quotas, responsibility separation between the LineA staff and the end-user scientists, desiring a standardized output format, uniform code for postprocessing without using the RAIL libraries. I won't speculate about these, though, right now I'm just discussing with them their choices and and forward back their conclusions.

Directly answering your question: it's not me who will decide, but we can work either with or without the PDFs on a grid. And, of course, if the upstream is unwilling to add the support, that's not a problem either, the issue can be closed as invalid.

eacharles commented 5 months ago

Ok, that makes sense. Thanks.

I'd be curious to know that the issues are, and how the LineA staff intend to post-process the outputs, as that would influence things like how we might provide a generic tool for for converting qp distribution between different representations, or providing generic parameters as part of a rail base class to override the qp representation used for output.

hdante commented 5 months ago

Hello, the discussion was focused around either having a uniform representation for postprocessing or using more efficient representations for reducing the storage demands. We settled on using more efficient representations and requiring the users to use the qp library to postprocess the files. This means this issue can be closed as invalid.

Thanks for the help, everyone,