Idea for speeding-up model transfer

paoloserra commented 4 years ago

Transferring a WSClean continuum sky model to a high-frequency-resolution .MS with Crystalball takes long. One way to speed this up is to limit the transfer to the brightest continuum sources in the field. Once these have been transferred and subtracted, UVLIN might be able to subtract the remaining, faint continuum sources.

Crystalball already allows one to only transfer the N brightest WSClean components in the sky model. However, this is not the same as selecting the brightest sources (where by source I mean a selection of clean components linked together on the sky). Indeed, the brightest components are selected regardless of whether they belong to a same source or not. As a result, one may end up only transferring some portions of a bright source. This may be particularly bad for diffuse continuum sources, which may make up most of the flux in the field but break up in lots of faint components.

The new idea is to use the SoFiA clean mask, inclusive of source IDs, to assign each WSClean component to a source, and then use this information when selecting the subset of WSClean components that should be transferred. The idea is that one either does or does not transfer a source as a whole, not just some of its components.

A first implementation of this idea is available at the branch https://github.com/caracal-pipeline/caracal/tree/selectbright . Compared to CARACal master the only new parameter is selfcal: transfer_model: flux_fraction, which must be a float in the range (0,1]. The code will select only WSClean components belonging to the brightest sources which make up the specified fraction of the sky model's total flux. The default value is 1, i.e., all components are transferred.

(If the SoFiA clean mask does not exist yet, for example because one didn't use SoFiA when imaging the continuum, it could easily be created based on the WSClean model .FITS image. But that needs to be implemented.)

Please give it a go, look at the code and/or comment on the general idea.

KshitijT commented 4 years ago

A ds9 region file made around the brightest sources / sources near the precious HI emission would also do something similar?

paoloserra commented 4 years ago

Not sure @KshitijT . Who's going to draw the region? Using a clean mask (whether existing or created based on the CLEAN model .FITS image) is automated and allows you to select the brightest sources. Can the same be done in DS9?

KshitijT commented 4 years ago

Who's going to draw the region?

Astropy, if you give it the coordinates and a size (like in the ddcal worker).

Sofia clean mask method looks very clean and I would generally just go with it. I was wondering, just in case of fluffy sources with a million components or so, would it be faster to just do a brute-force implementation with coordinates of interest fed in and reject all the other stuff in the field (which might not even be of interest and can be taken care of by uvlin)? This method would avoid: 1) Component vs Source issue 2) Adding sofia mask (I don't know off-hand how long this takes).

This is nothing as sophisticated as what you are implementing (with a flux threshold etc), mostly just idle wondering and comes under the "comment" part of your post. :)

o-smirnov commented 4 years ago

It would be very cool to feed this back into Codex somehow.

Filtering and manipulating a list of sources based on islands in a mask image and/or DS9 regions is pretty basic and useful functionality!

paoloserra commented 4 years ago

@KshitijT I see. Well, in fact, we already have part of that! See the transfer_model: within option.

paoloserra commented 4 years ago

@o-smirnov that's something for @sjperkins . For the time being I'd be keen to experiment with it in CARACal.

I'm also thinking that faint sources can be left out of the transferring and subtracted with UVLIN only as far as they're not too far from the phase centre. For large distances from the phase centre UVLIN won't do a good job. So maybe this selection should only be restricted (optionally) to a certain radius from the phase centre?

sjperkins commented 4 years ago

Some of this may already be in https://github.com/ska-sa/codex-africanus/pull/192 (original author @bennahugo, I've just massaged it)

sjperkins commented 4 years ago

Also naively going to mention BDA

KshitijT commented 4 years ago

@KshitijT I see. Well, in fact, we already have part of that! See the transfer_model: within option.

Yes, the inbuilt functionality of Crystalball to make use of ds9 regions was why I thought about it. The other part, as I mentioned, is in the ddcal worker: https://github.com/caracal-pipeline/caracal/blob/master/caracal/workers/ddcal_worker.py#L247 .

sjperkins commented 4 years ago

Also naively going to mention BDA

By which I mean predict at BDA resolution and then expand "somehow" to full resolution.

sjperkins commented 1 year ago

The new idea is to use the SoFiA clean mask, inclusive of source IDs, to assign each WSClean component to a source, and then use this information when selecting the subset of WSClean components that should be transferred. The idea is that one either does or does not transfer a source as a whole, not just some of its components.

Possibly stupid question: Would all components for a single source share the exact same coordinates? If so, would it be reasonable to simply group components by coordinate and select the brightest total?

paoloserra commented 1 year ago

Generally they'd be near one another, but not exactly at the same RA,Dec.

sjperkins commented 1 year ago

Generally they'd be near one another, but not exactly at the same RA,Dec.

The new idea is to use the SoFiA clean mask, inclusive of source IDs, to assign each WSClean component to a source, and then use this information when selecting the subset of WSClean components that should be transferred. The idea is that one either does or does not transfer a source as a whole, not just some of its components.

I think I see. Would the output of the above process be a wsclean file where a source name could be duplicated multiple times:

Format = Name, Type, Ra, Dec, I, SpectralIndex, LogarithmicSI, ReferenceFrequency='125584411.621094', MajorAxis, MinorAxis, Orientation
s0c0,POINT,08:28:05.152,39.35.08.511,0.000748810650400475,[-0.00695379313004673,-0.0849693907803257],false,125584411.621094,,,
s0c0,POINT,08:28:05.152,39.35.08.511,0.000748810650400475,[-0.000898135869319762,0.0183710297781511],false,125584411.621094,,,
s0c2,POINT,08:18:44.309,39.38.37.773,0.000233552686127518,[-0.000869089801859608,0.0828587947079702],false,125584411.621094,,,
s0c2,POINT,08:18:44.309,39.38.37.773,0.000233552686127518,[0.001264109956439,0.0201438425344451],false,125584411.621094,,,

The components could then be grouped by name (source ID) and the relevant percentage of the flux selected?

paoloserra commented 1 year ago

Might be quicker to chat in person, but my method was simply creating a new list of components, which is a subset of the input list of components. Basically, components not included in the selected sources are excluded from the list.

sjperkins commented 1 year ago

Might be quicker to chat in person, but my method was simply creating a new list of components, which is a subset of the input list of components. Basically, components not included in the selected sources are excluded from the list.

I could chat now if you have time (pinged you on google chat)

sjperkins commented 1 year ago

Hi @paoloserra. In the last line of the sample code below, I think there's an assumption that the id's in the clean mask file are associated one-to-one with the wsclean source id's. Is this always true in general? I'm guessing that an application like SoFia will associate many wsclean source's to one source in the the clean mask file, so I guess this is where the friend of friend algorithm is required?

# Get flux densities
fluxdens = wsclean_sources[:,4].copy().astype(float)
...
# Get X,Y location of WSClean components in the .FITS clean mask
wcsin = wcs.WCS(head, naxis=[wcs.WCSSUB_CELESTIAL])
x,y = wcsin.wcs_world2pix(ra,dec,0)
x,y = np.rint(x).astype(int),np.rint(y).astype(int)
print('# Assigning WSClean components to objects in the .FITS clean mask based on X,Y')

# Associate each WSClean component to an object from the .FITS clean mask
ids = data[y,x]
nr_sources = data.max()
int_fluxes = np.array([fluxdens[ids==ii].sum() for ii in range(1,nr_sources+1)])

paoloserra commented 1 year ago

I don't think so. The line

ids = data[y,x]

creates an array where, for each wsclean component, we list the corresponding source ID from the clean mask.

Multiple wsclean components can belong to the same source and, therefore, have the same value in ids.

That's why, in the last line above, to calculate a source's integrated flux we sum over all wsclean components with that source's ID.

paoloserra commented 1 year ago

The FoF is required if you do not want to use a clean mask.

If you do use a clean mask, the source IDs it contains are already the result of a FoF

caracal-pipeline / caracal

Idea for speeding-up model transfer #1222