Process CCD Notebook - Githubissues

kadrlica commented 6 years ago

This is mostly my attempt to run processEImage.py on the twinkles data. It has ended up not being as easy as I hoped.

drphilmarshall commented 6 years ago

"It has ended up not being as easy as I hoped."

First contender for Stack Club tagline? :-)

Do you want to submit a PR and use it to point out some sticking points? Might be good for all of us watching to appreciate the challenges. Thanks for taking a crack at this!

kadrlica commented 6 years ago

Current version of the notebook can be found here but not sure if it's ready for a PR. The notebook is not fully functional, and the github PR interface shows changes to the json, which I don't find especially enlightening (though maybe I'm missing something?).

Some of the sticking points:

processEimage.py currently relies on Python2, but the stack versions on jupyterlab are all python3. I've created my own version of processEimage.py that updates the python version and it appears to work, but cannot be distributed with the notebook. In the notebook I directly interface with ProcessEimageTask API, but this is clunky.
Use of old-style reference catalogs astrometry_net_data which are not installed in current versions of the stack (UNRESOLVED)
Larger memory use than can be supported in a tiny jupyterlab instance (this can be solved by choosing medium)
My general opinion is that running something that takes several minutes to complete doesn't seem optimal for the notebook format.

drphilmarshall commented 6 years ago

Thanks Alex.

processEimage.py relying on python 2 sounds like a bad sign: it could be falling behind the stack, perhaps through dis-use. Is processCCD.py itself python 3-compatible? If so, shelving the processEimage.py part of the tutorial and using a different (and raw) test image might be the way to go, I guess. Would that solve the reference catalog problem as well?
The memory issue looks serious. Suppose we insist on being able to run processCCD.py from a tutorial notebook, @SimonKrughoff - what would the implications be for the NCSA accounts? Is medium an allowable option for Stack Club members?

A couple of minutes runtime is awkwardly long for a live run through in front of an audience - but for a notebook that you work through at home I think it's fine (as long as the text tells you that you'll need to wait!). The other thing we could do is have on hand the results of a pre-baked run - which could be helpful for comparing results, but would also enable the user to skip the time-consuming part and get to the later section where the results of the task are displayed and explained.

Regarding PR threads for notebooks: I agree it's not ideal that we can only make line comments on the raw notebook code, but the PR thread does at least provide a "view" button where you can see the rendered notebook (so it saves you pasting links to the file in the branch). And then of course it enables the usual code review conversation in place, as well as tracking all the contributions made. I'd say that notebook was ready for PR no problem - you should just flag it as not yet ready to merge, in the initial comment. (Personally I like the workflow where you make the PR as soon as the dev branch is made in the base repo - so that that branch has an explanation for its existence.)

kadrlica commented 6 years ago

As far as I know, the twinkles and HSC data sets are the best curated. My personal thinking was that it was better to use simulated LSST images than real data from HSC. If we use the HSC dataset we should be able to follow the v15.0 tutorial line for line, but I'm not sure we want to do that.

The python3 compatibility is a one-line change to the shebang. I've already submitted an issue on this here. We also might be able to get away with something smaller than medium but I haven't checked.

SimonKrughoff commented 6 years ago

1) I think running on medium should be fine. The cluster is fairly large, so I don't think we'll run out of CPUs. I think we will learn a lot about how k8s packs jobs onto nodes and how to configure things correctly. 2) Diffing notebooks is a known issue. There is a package called nbdime that looks promising, but is not integrated into the github UI. 3) The python2 problem should just be fixed. I'll try to get to that tomorrow and then push it to the fork.

SimonKrughoff commented 6 years ago

I never got to 3 above, but it is on the branch that Heather Kelly is about to merge.

kadrlica commented 6 years ago

ProcessEImage is now broken with the following error:

---------------------------------------------------------------------------
ParentsMismatch                           Traceback (most recent call last)
/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/butler.py in _setAndVerifyParentsLists(self, repoDataList)
    925                     try:
--> 926                         repoData.cfg.extendParents(parents)
    927                     except ParentsMismatch as e:

/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/repositoryCfg.py in extendParents(self, newParents)
    181                                   "existing parents list in this RepositoryCfg: {}").format(
--> 182                                   newParents, self._parents))
    183 

ParentsMismatch: The beginning of the passed-in parents list: [RepositoryCfg(root='../../../../../DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None), RepositoryCfg(root='../..', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)] does not match the existing parents list in this RepositoryCfg: [RepositoryCfg(root='/home/kadrlica/notebooks/DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)]

kadrlica commented 6 years ago

I've decided to give up on ProcessEimage and move to ProcessCcd with the HSC data.

kadrlica commented 6 years ago

Simon suggests that some modifications of configuration parameters could be useful:

Change the binning of the background model
Change the PSF model
Change the source detection threshold

He also suggests looking at Jim Bosch's Leon notebook:

https://github.com/lsst-dm/dm-demo-notebooks/blob/master/workshops/lsst2018/intro-with-globular.ipynb

kadrlica commented 6 years ago

He also suggests Jonathan's documentation on configuration:

https://pipelines.lsst.io/v/weekly/modules/lsst.pipe.base/command-line-task-config-howto.html

LSSTScienceCollaborations / StackClub

Process CCD Notebook #34