Open kadrlica opened 6 years ago
"It has ended up not being as easy as I hoped."
First contender for Stack Club tagline? :-)
Do you want to submit a PR and use it to point out some sticking points? Might be good for all of us watching to appreciate the challenges. Thanks for taking a crack at this!
Current version of the notebook can be found here but not sure if it's ready for a PR. The notebook is not fully functional, and the github PR interface shows changes to the json, which I don't find especially enlightening (though maybe I'm missing something?).
Some of the sticking points:
processEimage.py
currently relies on Python2, but the stack versions on jupyterlab are all python3. I've created my own version of processEimage.py
that updates the python version and it appears to work, but cannot be distributed with the notebook. In the notebook I directly interface with ProcessEimageTask
API, but this is clunky.astrometry_net_data
which are not installed in current versions of the stack (UNRESOLVED)tiny
jupyterlab instance (this can be solved by choosing medium
)Thanks Alex.
processEimage.py
relying on python 2 sounds like a bad sign: it could be falling behind the stack, perhaps through dis-use. Is processCCD.py
itself python 3-compatible? If so, shelving the processEimage.py
part of the tutorial and using a different (and raw) test image might be the way to go, I guess. Would that solve the reference catalog problem as well?
The memory issue looks serious. Suppose we insist on being able to run processCCD.py
from a tutorial notebook, @SimonKrughoff - what would the implications be for the NCSA accounts? Is medium
an allowable option for Stack Club members?
A couple of minutes runtime is awkwardly long for a live run through in front of an audience - but for a notebook that you work through at home I think it's fine (as long as the text tells you that you'll need to wait!). The other thing we could do is have on hand the results of a pre-baked run - which could be helpful for comparing results, but would also enable the user to skip the time-consuming part and get to the later section where the results of the task are displayed and explained.
Regarding PR threads for notebooks: I agree it's not ideal that we can only make line comments on the raw notebook code, but the PR thread does at least provide a "view" button where you can see the rendered notebook (so it saves you pasting links to the file in the branch). And then of course it enables the usual code review conversation in place, as well as tracking all the contributions made. I'd say that notebook was ready for PR no problem - you should just flag it as not yet ready to merge, in the initial comment. (Personally I like the workflow where you make the PR as soon as the dev branch is made in the base repo - so that that branch has an explanation for its existence.)
As far as I know, the twinkles and HSC data sets are the best curated. My personal thinking was that it was better to use simulated LSST images than real data from HSC. If we use the HSC dataset we should be able to follow the v15.0 tutorial line for line, but I'm not sure we want to do that.
The python3 compatibility is a one-line change to the shebang. I've already submitted an issue on this here. We also might be able to get away with something smaller than medium
but I haven't checked.
1) I think running on medium should be fine. The cluster is fairly large, so I don't think we'll run out of CPUs. I think we will learn a lot about how k8s packs jobs onto nodes and how to configure things correctly.
2) Diffing notebooks is a known issue. There is a package called nbdime
that looks promising, but is not integrated into the github UI.
3) The python2 problem should just be fixed. I'll try to get to that tomorrow and then push it to the fork.
I never got to 3 above, but it is on the branch that Heather Kelly is about to merge.
ProcessEImage is now broken with the following error:
---------------------------------------------------------------------------
ParentsMismatch Traceback (most recent call last)
/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/butler.py in _setAndVerifyParentsLists(self, repoDataList)
925 try:
--> 926 repoData.cfg.extendParents(parents)
927 except ParentsMismatch as e:
/opt/lsst/software/stack/stack/miniconda3-4.3.21-10a4fa6/Linux64/daf_persistence/16.0/python/lsst/daf/persistence/repositoryCfg.py in extendParents(self, newParents)
181 "existing parents list in this RepositoryCfg: {}").format(
--> 182 newParents, self._parents))
183
ParentsMismatch: The beginning of the passed-in parents list: [RepositoryCfg(root='../../../../../DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None), RepositoryCfg(root='../..', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)] does not match the existing parents list in this RepositoryCfg: [RepositoryCfg(root='/home/kadrlica/notebooks/DMStackClub/ImageProcessing/DATA', mapper=<class 'lsst.obs.lsstSim.lsstSimMapper.LsstSimMapper'>, mapperArgs={}, parents=[], policy=None)]
I've decided to give up on ProcessEimage
and move to ProcessCcd
with the HSC data.
Simon suggests that some modifications of configuration parameters could be useful:
He also suggests looking at Jim Bosch's Leon notebook:
He also suggests Jonathan's documentation on configuration:
This is mostly my attempt to run
processEImage.py
on the twinkles data. It has ended up not being as easy as I hoped.