New Extragalactic Tutorial (Milky Way Analogues)

sdmork commented 1 year ago

Added a new Extragalactic tutorial detailing the process of finding Milky Way Analogues in the Legacy Survey.

@kadrlica

kadrlica commented 1 year ago

@simonmork This notebook requires some minor updates before it would be ready to request a Data Lab review. In particular, I think that we might want to see if we can make a version that runs a bit more quickly. Let me know if you'd be interested in pursuing these updates over the summer.

kadrlica commented 1 year ago

@simonmork I've made some updates to the notebook (including changing the name to something more descriptive). Take a look and let me know if this looks ok to you.

kadrlica commented 1 year ago

@jacquesalice and @rnikutta: Both Simon and I have taken another pass through this notebook. Would you have a chance to start a formal review? Thanks!

rnikutta commented 1 year ago

Hi @kadrlica and @simonmork , many thanks! We'll review ASAP (might be next week and the week after though, due to deadlines and travels).

sdmork commented 1 year ago

Hi @jacquesalice,

I have implemented the comments you mentioned in your above message. The only change I did not implement was #9 because the magnitudes that I pull I want to be dereddened and I am unsure if the precomputed colors in the tractor catalog are dereddened.

Let me know if there are any other final changes you want me to make!

-Simon

@kadrlica

kadrlica commented 1 year ago

Thanks Alice and Simon. Sorry about the extra notebook, that was probably my fault. Do we want to commit all the cutout images?

rnikutta commented 1 year ago

Thanks @simonmork . Indeed, the pre-computed colors appear to be using the not-dereddened mags, which can be confirmed like this:

df = qc.query("select mag_g, mag_r, dered_mag_g, dered_mag_r, g_r from ls_dr9.tractor limit 5",fmt='pandas')
df['my_g_r'] = df['mag_g'] - df['mag_r']
df['my_dered_g_r'] = df['dered_mag_g'] - df['dered_mag_r']
df

    mag_g       mag_r       dered_mag_g dered_mag_r g_r        my_g_r     my_dered_g_r
0   25.273912   24.131920   25.204136   24.084917   1.141993   1.141992   1.119219
1   25.448313   25.026280   25.378372   24.979166   0.422033   0.422033   0.399206
2   24.322979   23.566902   24.253012   23.519772   0.756077   0.756077   0.733240
3   26.226358   24.410812   26.156416   24.363699   1.815546   1.815546   1.792717
4   24.746489   24.035713   24.676622   23.988651   0.710775   0.710776   0.687971

I'll add my comments to the NB (if any) in the coming 2-3 days if that's ok.

rnikutta commented 1 year ago

@kadrlica No, please don't commit any outputs other that the NB itself. Thanks.

sdmork commented 1 year ago

@rnikutta I have recommitted changes deleting all exterraneous cutouts. Thank you for clearing up my confusion regarding the precomputed colors and reddening.

galaxyumi commented 1 year ago

Hi @simonmork and @kadrlica, I am a new datalab team member. Thanks for your contribution! It is a very nice NB.

Here are only minor suggestions:

In Goals, "SDSS-identified spiral galaxy Milky Way analogues" -> "SDSS-identified Milky Way analogues" or "SDSS-identified Milky Way-type spiral galaxies"
In Summary, "Milky-Way analogues are a unique class of object" -> "Milky Way analogues are a unique class of objects"
Can you fix the reference numbering to match flow? For example, the second reference appeared in Goals section is numbered as [4], not [2].
Missing unit on y-axis label in color-magnitude plots.

sdmork commented 1 year ago

Hi @simonmork and @kadrlica, I am a new datalab team member. Thanks for your contribution! It is a very nice NB.

Here are only minor suggestions:

In Goals, "SDSS-identified spiral galaxy Milky Way analogues" -> "SDSS-identified Milky Way analogues" or "SDSS-identified Milky Way-type spiral galaxies"

In Summary, "Milky-Way analogues are a unique class of object" -> "Milky Way analogues are a unique class of objects"

Can you fix the reference numbering to match flow? For example, the second reference appeared in Goals section is numbered as [4], not [2].

Missing unit on y-axis label in color-magnitude plots.

Hi @galaxyumi, I have implemented the changes you listed in this message and have pushed them to my branch. Let me know if there is anything else you need.

galaxyumi commented 1 year ago

Thanks @simonmork!

@rnikutta, @jacquesalice - the NB looks good to me.

rnikutta commented 1 year ago

Hi @simonmork and @kadrlica ,

sorry for the stretched-out reviews... I finally found some time to take your NB fr a spin, it's very nice indeed! Thank you so much for contributing it to Data Lab. My comments below; they should be the last ones!

Link any papers and external references directly. E.g., Licquia et al. 2016 [1]. You can either link directly (not need for footnotes), or link internally to the footnote, and from the footnote to the external resource.
Replace anywhere in the text "LSDR9" with "LS DR9"
In "Goals": "in order to locate Milky Way analogues to higher redshifts than many SDSS-identified Milky Way analogues." --> "in order to locate Milky Way analogues to higher redshifts than many SDSS-identified ones."
In "Summary": "we query the datalab servers" --> "Data Lab servers"
"sources that are best-fit with a galaxy profile" --> "best fit"
In "Imports & Setup": I think you don't need import warnings; warnings.filterwarnings("ignore"), as I didn't see any warning without it. (Also, ignoring all types of warnings is not a good strategy, one might miss something unexpected yet important.)
If you want the result of a query as, e.g., a Pandas dataframe, you can do this:
```
q = "SELECT foo, bar FROM ..."
response = qc.query(adql=q, fmt='pandas')
```
I.e. you then also don't need to import the 'convert' helper function.
Remove %%time from the one cell that uses it. Since recently we have a built-in cell execution timer installed.
Nice solution you found to split the RA and Dec ranges into parts, run your query for each quadrant independently, and combine the results in the end. I must admit I was surprised that this was faster than running the query once, but on the entire RA and Dec range. Our DB specialist, and myself, have tried for some hours to find a faster way, but any solution required non-standard trickery, so we can't recommend it.

For your looping over the RA and DEC slices, I suggest this simplification (simpler than the current logic):

Removed time.sleep(10), it only slows down the NB.
Changed to JOIN ON syntax (the WHERE id1 = id2 syntax is equivalent, but deprecated)
Used BETWEEN syntax
Simplified loop stepping logic
Requesting pandas DataFrame as output
Raised timeout because when the server is busy a query could time out

Code to replace the entire looping query cell:

width = 20
RAMIN, RAMAX = 90, 130
DECMIN, DECMAX = 40, 80

dfs = []
query = """
SELECT t.ra, t.dec, t.type, t.dered_mag_g, t.dered_mag_r, t.dered_mag_z,
     t.shape_e1, t.shape_e2, t.shape_r, p.z_phot_mean
FROM ls_dr9.tractor as t
JOIN ls_dr9.photo_z as p
ON t.ls_id = p.ls_id
AND p.z_phot_mean BETWEEN 0 AND 0.1
AND t.ra BETWEEN {} AND {}
AND t.dec BETWEEN {} AND {}"""

for ra1 in range(RAMIN,RAMAX,width):
  for dec1 in range(DECMIN,DECMAX,width):
      t1 = time.time()
      q = query.format(ra1, ra1+width, dec1, dec1+width)
      response = qc.query(sql=q, fmt='pandas', timeout=600)
      dfs.append(response)
      print('FINISHED FOR RA %d-%d, DEC %d-%d AFTER %d s' % (ra1,ra1+width,dec1,dec1+width,time.time()-t1))

df=pd.concat(dfs,axis=0)
df

Please adjust the description of the looping procedure (just under the headline "Data Access") accordingly.
Please follow PEP8 recommendations regarding whitespace around operators, assignments and in variables tuples. https://peps.python.org/pep-0008/#whitespace-in-expressions-and-statements

Examples:
- This:
  
  masterfilter=np.ones(len(arrays[0]),dtype='bool')
should be:

masterfilter = np.ones(len(arrays[0]),dtype='bool')
- This:
  
  ra,dec,width=i,j,int(SKIP/2)
should be:

ra, dec, width = i, j, int(SKIP/2)

etc.

Note that there's no whitespace around '=' in keyword arguments to functions.
If you're using a function only once, don't define it. Functions are for re-use. I.e., remove def noni() and have the code directly in the cell.

Also, the mechanics of defining your filters are unnecessarily complicated. The entire cell that contains def noni() can be replaced with this:

ninfilter = df['dered_mag_g'].notnull() & df['dered_mag_z'].notnull() & df['dered_mag_z'].notnull()
typfilter = (df['type'] == 'DEV') | (df['type'] == 'EXP') | (df['type'] == 'SER')
data = df[ninfilter & typfilter]
print(len(data),'galaxies after quality cuts')

Several comments on this cell:

g = data['dered_mag_g']
r = data['dered_mag_r']
z = data['dered_mag_z']
g_r = g - r
r_z = r - z

redshifts = np.asarray(data['z_phot_mean'])
abs_e = (data['shape_e1']**2+data['shape_e2']**2)**0.5
a = (1-abs_e)/(1+abs_e)

half_light_rad = np.pi/180/60/60*data['shape_r']
half_light_kpc = cosmo.angular_diameter_distance(z=redshifts).value*1000*half_light_rad

cosmo_correction = cosmo.distmod(z=redshifts).value + 5*np.log10(cosmo.h)
M_g = g - calc_kcor('g', data['z_phot_mean'], 'g - r', g_r) - cosmo_correction
M_r = r - calc_kcor('r', data['z_phot_mean'], 'g - r', g_r) - cosmo_correction
M_z = z - calc_kcor('z', data['z_phot_mean'], 'r - z', r_z) - cosmo_correction

The code lines computing 'a' need a bit more context. What's happening, and why? Either ad a line of explanation to the markdown cell above the code cell, or add a comment line just before the code:

# Compute XYZ, which is the ABC (and add reference if possible)
abs_e = (data['shape_e1']**2+data['shape_e2']**2)**0.5
a = (1-abs_e)/(1+abs_e)

Same for the two lines computing the halflight radius.
Think if you could use astropy.units; and if not desired, at least state the units for each line

At the end, before "References", add a "File Cleanup" cell.
- Text below the Section headline. E.g.: "To remove the cutout generated images from the directory, un-comment the cell below and run it."
- Followed by a commented-out code cell:
```
# !rm -f `pwd`/*png
```

sdmork commented 1 year ago

Hi @rnikutta,

Thank you for your detailed feedback. I have implemented your requested changes, and so the notebook should be ready for a final review and merge. Let me know if there is anything else I need to do. Thanks again!

rnikutta commented 1 year ago

Thanks @simonmork ! A few items have not been addressed though:

Link to any and all references (e.g. the Licquia et al. 2016 paper, the Dey et al. 2019 paper, etc. (specifically, in the Goals and Summary sections)

I don't think your loop logic


RAMIN, RAMAX, RACHUNKS = 90, 120, 3
DECMIN, DECMAX, DECCHUNKS = 40, 70, 3

ra_width = (RAMAX-RAMIN)/RACHUNKS dec_width = (DECMAX-DECMIN)/DECCHUNKS for ra in np.linspace(RAMIN, RAMIN+ra_width(RACHUNKS-1), RACHUNKS): for dec in np.linspace(DECMIN, DECMIN+dec_width(DECCHUNKS-1), DECCHUNKS):


is simpler than the one I proposed:

width = 20 RAMIN, RAMAX = 90, 130 DECMIN, DECMAX = 40, 80

for ra1 in range(RAMIN,RAMAX,width): for dec1 in range(DECMIN,DECMAX,width):



Why do you prefer it?

- Replace this wording:
" Since we are limited by the amount of data that any one query is allowed to take at any given time"
with
" Since we are limited by the maximum runtime of a query"

- Replace "In order to use the selection cell for arbitrary RA and Dec, change the following values:" with "Change the following values to adjust for your use case:"
- Fix typo in "half-light radiys"

Everything else looks great!

sdmork commented 1 year ago

Hi @rnikutta, sorry about the link mixup and typos, those should be up to snuff now.

As for the loop logic, I've left in place my most recent implementation for three reasons.

The end user can query for rectangles of arbitrary aspect ratio, since the loop automatically computes the chunk width in both the RA and Dec directions, whereas the suggested loop is limited to square chunks.
It allows for decimal values of width, whereas the suggested loop with the python-native range() function only works for integers.
Using the suggested loop is unstable if (((RAMAX-RAMIN)%width == 0) and ((DECMAX-DECMIN)%width == 0)) == False since it will overshoot the specified RAMAX and DECMAX in these instances. See the below image for an example.

Screenshot 2023-09-18 220846

I appreciate your detailed follow-up very much!

kadrlica commented 1 year ago

Congratulations @simonmork! Thank you reviewers for all your detailed feedback!

astro-datalab / notebooks-latest

New Extragalactic Tutorial (Milky Way Analogues) #170