MPoL-dev / MPoL

A flexible Python platform for Regularized Maximum Likelihood imaging
https://mpol-dev.github.io/MPoL/
MIT License
33 stars 11 forks source link

uv distribution figure #221

Closed jeffjennings closed 8 months ago

jeffjennings commented 9 months ago

NOTE: Should be reviewed/merged after #220 (was branched from). [done]

Adds a plotting function to show the 2d (u,v) distribution, with points colored by Re(V).

Meant to address #219

jeffjennings commented 9 months ago

Example figure generated: IMLup_uv_dist

iancze commented 9 months ago

Thanks. I'm curious about your choice to plot Re only? I think amplitude would be more generally useful.

I think the functionality that I was searching for in #219 is a way to roughly estimate the signal (binned amplitude) and sensitivity (binned weights) from the visibilities as demonstrated in the figure in #97, which can be generated without any imaging parameters by doing the histogram binning directly on the loose visibilities. In terms of an analysis workflow, I would expect one to first plot up the baseline distribution (as in this #221, perhaps, even just using u,v as scatter points with no amp information), then maybe plot a 1D distribution of amplitudes, and examine 2D sensitivity & num visibilities using a histogram (since it's hard to do from a scatter plot like #221, so many points overlap), and then move on to gridding and making a dirty image, etc.

Is the recommended way to get here then to choose imaging parameters, create a GriddedDataset, and then feed that into mpol.plot.vis_histogram_fig? It seems like this route might hamper one from estimating number of visibilities and other metrics like scatter relative to expected scatter. I guess there's also a consideration of plotting histograms on a cartesian grid (more relevant for scatter relative to weights, does require coords object) vs. a polar grid (easier to visualize sensitivity, does not require coords object).

jeffjennings commented 9 months ago

Sure, the coloring could be amplitude or Re(V) or Im(V) or none.

I understand what you had in mind, though the reason I wanted to leave vis_histogram_fig as-is is that it's showing what the model sees, which is useful especially when using the dartboard to split train/test. It's a part of the modeling pipeline in that sense.

I'm less clear on where this kind of initial data visualization / pre-processing should go. It's in the docs a bit, but also in the visread docs. I'm not sure if you have in mind to use a minimal-dependency visread (#227) to pre-process and visualize data that can then be fed to MPoL, or if you want MPoL to do it all. If the latter, I think the routines in this kind of 'pre-processing pipeline' should be distinct from those used in the modeling pipeline to avoid user confusion and complication of the functions.

In any case, this PR doesn't have to be merged. It was meant to be a standalone pre-processing routine, but maybe it doesn't belong in MPoL. I do think #219 should be rescoped though, as to me at least it's better if the routines going into these two pipelines are clearly separated (either in MPoL or between MPoL and visread).

iancze commented 9 months ago

Thanks, you make very good points that suggest some organizational thought about the purposes of the visread v. MPoL packages.

My original thinking for the bifurcation of the packages was to have visread with the casatools dependency for reading/averaging/exporting visibilities from a measurement set, and MPoL for everything related to RML imaging (especially those depending on PyTorch). But recent developments complicate this picture a little bit.

I'm currently in the process of reorganizing visread to have two install tiers: one with casatools dependency to do the reading from an MS, and another without casatools dependencies, so routines to estimate visibility scatter and do channel broadcasting can be easily imported into modern environments that might read from an .asdf file, since these seem misplaced in an imaging-focused MPoL package.

Thinking through how a user might export visibilities from a measurement set for MPoL analysis (e.g., using my current tools available):

1) in a Python 3.8 environment installed on RHEL (i.e., the best I have available where I can actually install casatools without error). currently: install visread, read and average visibilities, write to a neutral data format like .npy or .asdf. anticipated: install mpol[casa], read and average visibilities, write to neutral data format. Then: 2) in a modern dev environment (for me, > Python 3.10, CUDA docker image): run mpol after loading .npy or .asdf file.

Is it confusing to recommend to someone that they install MPoL two different ways on two different Python versions? Is this preferable to the current setup, where in the MPoL docs we may frequently recommend (if not downright require) the user to also install visread to do some visibility manipulations?

Of course, if a user's modern dev environment is Python 3.8 RHEL, then they would just use mpol[casa] directly. The perennial problem is that the casatools package significantly lags modern python releases, so making this a core requirement (mpol) substantially constraints the install environments and blocks a large user base from using this software, which we do not want to do.

After writing out the use case, I think it's preferable to push 'data exploration' capabilities into visread (minus casatools dep) and keep MPoL focused on the RML imaging + data modeling (e.g., Pyro + SVI). It's easier to explain to the user that they can use visread to prepare and examine the data (and that this may require a casatools dependency for reading from an MS) than to explain that they need to install MPoL twice into two different environments, but curious to hear your thoughts @jeffjennings .

jeffjennings commented 9 months ago

Thanks, you make very good points that suggest some organizational thought about the purposes of the visread v. MPoL packages.

My original thinking for the bifurcation of the packages was to have visread with the casatools dependency for reading/averaging/exporting visibilities from a measurement set, and MPoL for everything related to RML imaging (especially those depending on PyTorch). But recent developments complicate this picture a little bit.

I'm currently in the process of reorganizing visread to have two install tiers: one with casatools dependency to do the reading from an MS, and another without casatools dependencies, so routines to estimate visibility scatter and do channel broadcasting can be easily imported into modern environments that might read from an .asdf file, since these seem misplaced in an imaging-focused MPoL package.

  • This separation makes visread (w/o CASA deps) a natural fit for routines like the uv distribution figure, 1D histograms, etc, since these don't involve PyTorch or anything specific to RML.
  • On the flipside, though, if we can isolate the casatools to an optional dependency, why not just incorporate all of the visread functionality into MPoL itself, ms reading included? Would this be easier for the user to deal with? We would essentially have a suite of 'data exploration' tools within MPoL that can average / plot visibilities, and we would have a (casatools-dependent) module that could read visibilities from an MS.

You could have both -- all of this functionality is in visread, and there's also a pre-processing pipeline in MPoL that just calls several visread routines to e.g. make plots or output metrics.

Thinking through how a user might export visibilities from a measurement set for MPoL analysis (e.g., using my current tools available):

  1. in a Python 3.8 environment installed on RHEL (i.e., the best I have available where I can actually install casatools without error). currently: install visread, read and average visibilities, write to a neutral data format like .npy or .asdf. anticipated: install mpol[casa], read and average visibilities, write to neutral data format. Then:
  2. in a modern dev environment (for me, > Python 3.10, CUDA docker image): run mpol after loading .npy or .asdf file.

Is it confusing to recommend to someone that they install MPoL two different ways on two different Python versions? Is this preferable to the current setup, where in the MPoL docs we may frequently recommend (if not downright require) the user to also install visread to do some visibility manipulations?

Yes I think it's confusing and messy to have to install MPoL two ways in two Python environments. Requiring visread[non-casa] as a dependency seems perfectly fine - it's light and its dependencies are already required by MPoL (["numpy", "scipy", "astropy"]).

Of course, if a user's modern dev environment is Python 3.8 RHEL, then they would just use mpol[casa] directly. The perennial problem is that the casatools package significantly lags modern python releases, so making this a core requirement (mpol) substantially constraints the install environments and blocks a large user base from using this software, which we do not want to do.

Definitely not. That would add a lot of headaches for devs and users.

After writing out the use case, I think it's preferable to push 'data exploration' capabilities into visread (minus casatools dep) and keep MPoL focused on the RML imaging + data modeling (e.g., Pyro + SVI). It's easier to explain to the user that they can use visread to prepare and examine the data (and that this may require a casatools dependency for reading from an MS) than to explain that they need to install MPoL twice into two different environments, but curious to hear your thoughts @jeffjennings .

I agree. Again the docs can point to using visread for data prep/examination, and these functionalities can be run in a data processing pipeline (wrapper) in MPoL.

jeffjennings commented 8 months ago

Closing as out of scope with v0.3 redesign