Closed kadrlica closed 4 years ago
I've just merged [DM-21246|https://jira.lsstcorp.org/browse/DM-21246], which allows a butler to be constructed with no run
or collection
argument, so after the next weekly goes out it should be safe to update the notebook tutorial.
That should also include better diagnostics related to read-only repository errors, especially if you explicitly pass writeable=True
or writeable=False
when constructing the Butler (if you don't, it will create a writeable butler only if you pass a run
).
@bechtol here is a new version of the notebook. We should ask @TallJimbo to take a look, and if the syntax is stable enough, I'd suggest merging this into master.
A few comments:
repo
points at the ci_hsc_gen3 git repo, which is not a Gen3 data repo. I think things would be much clearer if you just ignored that directory and defined everything relative to the DATA
subdirectory within it (which is a Gen3 data repository).Butler
instance with just butler = Butler(repo)
(assuming repo == os.path.join(os.environ["CI_HSC_GEN3_DIR"], "DATA")
). Passing the full name of the config file works, but is more advanced usage, and since it's usually a required argument I think establishing a convention of passing it positionally is a good idea.property
to add that checking, but I think it's best to err on the side of caution in general about modifying attributes in Python: if you don't know it's safe, don't do it, because the language has no way of distinguishing for users (even by convention) which attributes are conceptually settable. In this case (and until that property is operational), I would recommend:
butler = Butler(butler=butler, collections=collections)
(yes, this is the rare case where the first config
argument is not necessary, which doesn't help my previous argument about passing it positionally, but I think the case where you don't know what collection you're starting with is unlikely to happen often outside examples, and adding a collections
property would take care of that, too).
Other than that last issue, I think the syntax here is all pretty stable. I would expect the optional arguments of the various query
methods to see some minor changes, but probably not the default behavior.
Thanks @TallJimbo I've implemented your suggestions.
Regarding this: "I think the case where you don't know what collection you're starting with is unlikely to happen often outside examples, and adding a collections property would take care of that, too"
Why do you say that it is unlikely to not know what collections you are starting with? It seems like for the immediate future, someone just giving you the path to a repo is fairly common?
Because a Gen2 repo maps more closely to a collection than a Gen3 repo, I think there are going to be very few Gen3 repos in the wild outside of those used in CI or in small-scale development on people's laptops (e.g. perhaps one or two for all of /datasets
on NCSA GPFS) . So when transferring knowledge about processing outputs between people, the repo will generally be implied from context, and the collection will be what they tell you.
I see. So the use case may be more like "I think Keith has been working on something cool, let me see if I can find the collection he was working off of" (the way that I might explore his "work" directory on a shared disk).
@DouglasLeeTucker @bechtol I think this is ready for review.
Yes, good point - that kind of use case is still going to involve butlers without collections. I'd normally expect that use case to be better supported by command-line tools that do that kind of querying, but I suppose that may not be the case in an environment like the LSP where there's a strong weight towards using notebooks over command-line tools.
I thought there was also no guarantee that the repo will be on the local disk. I guess there will be command line tools for accessing the structure of remote repos?
I guess there will be command line tools for accessing the structure of remote repos?
Yes, exactly.
OK, will look at it soon! --DLT
From: Alex Drlica-Wagner notifications@github.com Reply-To: LSSTScienceCollaborations/StackClub reply@reply.github.com Date: Friday, April 17, 2020 at 12:58 PM To: LSSTScienceCollaborations/StackClub StackClub@noreply.github.com Cc: Douglas L Tucker dtucker@fnal.gov, Mention mention@noreply.github.com Subject: Re: [LSSTScienceCollaborations/StackClub] Gen-3 butler tutorial (#225)
@DouglasLeeTuckerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DouglasLeeTucker&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=FNAzyPnzcjNEyqXtzvBQkxXwKOVWLuTvfIKmY1AjRkg&e= @bechtolhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bechtol&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=86Yr4ZFrIfGP6kjvW3rctisuE_0SqaFwLgrX_LEtw4A&e= I think this is ready for review.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LSSTScienceCollaborations_StackClub_pull_225-23issuecomment-2D615384353&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=vpsJ_qd3fT6IcdGi3CIeQHEqAsu-cS1gJcw6Bqsn0po&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AARAHOMUDOVWKMBXXOZFJSLRNCKCNANCNFSM4I5SSKQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=kro7HiaDsdS4pc1-BnU9Tg6NjURL9ix0r4C0vmpJUYA&e=.
Dear Alex,
I ran your update of the Gen-3 butler tutorial, and it worked fine!
While running it, though, I found 2 or 3 few very minor typos, all in this one section of text (3 typos if you consider “a butler” instead of “the butler” a typo 😊 , but 2 typos otherwise):
“To create a butler you need to pass it a configuration file and a run name. The run name tells the butler where the place output files. More on Butler configuration can be found herehttps://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html. By investigating the directory structue, we find that the 'collection' is…” • “To create the butler you need to pass it a configuration file and a run name. The run name tells the butler where to place output files. More on Butler configuration can be found herehttps://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html. By investigating the directory structure, we find that the 'collection' is…”
I am not sure if you fix those typos now. If not, I can go ahead and OK the pull request.
Thanks!
Best regards, Douglas
From: Alex Drlica-Wagner notifications@github.com Reply-To: LSSTScienceCollaborations/StackClub reply@reply.github.com Date: Friday, April 17, 2020 at 12:58 PM To: LSSTScienceCollaborations/StackClub StackClub@noreply.github.com Cc: Douglas L Tucker dtucker@fnal.gov, Mention mention@noreply.github.com Subject: Re: [LSSTScienceCollaborations/StackClub] Gen-3 butler tutorial (#225)
@DouglasLeeTuckerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DouglasLeeTucker&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=FNAzyPnzcjNEyqXtzvBQkxXwKOVWLuTvfIKmY1AjRkg&e= @bechtolhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bechtol&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=86Yr4ZFrIfGP6kjvW3rctisuE_0SqaFwLgrX_LEtw4A&e= I think this is ready for review.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LSSTScienceCollaborations_StackClub_pull_225-23issuecomment-2D615384353&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=vpsJ_qd3fT6IcdGi3CIeQHEqAsu-cS1gJcw6Bqsn0po&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AARAHOMUDOVWKMBXXOZFJSLRNCKCNANCNFSM4I5SSKQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=kro7HiaDsdS4pc1-BnU9Tg6NjURL9ix0r4C0vmpJUYA&e=.
Thanks @DouglasLeeTucker! Changed two of those typos, but I think I'm missing the third. Feel free to change the text directly on github if you like, otherwise I think you can approve the pull request.
Looks good! Thanks!
For completeness, it might be useful to show how to get the dataIds with the Gen-3 butler, for example
dataids = [x.dataId for x in registry.queryDatasets('calexp', collections='shared/ci_hsc_output')]
dataId = dataids[0]
calexp = butler.get('calexp', dataId=dataId)
Also:
refs = list(registry.queryDatasets("calexp", collections="shared/ci_hsc_output")
calexp = butler.getDirect(refs[0])
This avoids repeating some lookups that were already done in the original query, and it ensure that the dataset you found is the dataset you load - which might not be the case otherwise if the collections you pass to queryDatasets
differ from butler.collections
.
Another variant:
dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp"], collections="shared/ci_hsc_output"))
calexp = butler.get('calexp', dataId=dataIds[0])
This one does behave like the original in terms of doing a second lookup with possibly-different collections, but the original query is more flexible in that you could provide multiple datasets (requiring an instance of all datasets to be available for that data ID) or ask for different data ID keys than what is used to identify the dataset, which will invoke various built-in relationships.
Sorry, stupid shift+enter from slack...
Do I need to use butler.getDirect
or is get
smart enough to identify that I've given it a dataRef and pass it to getDirect
?
Also, it would also be nice to show an example of syntax to select a subset of dataIds, e.g., the calexp's for a particular filter
Do I need to use butler.getDirect or is get smart enough to identify that I've given it a dataRef and pass it to getDirect?
get
can take a DatasetRef
, and it also will check that the one retrieved from the butler's collection is the same one fully identified by the DatasetRef
if DatasetRef.id
is not None
(DatasetRefs
can either be in a "resolved" state in which they have an ID that fully identifies them independent of collection, or in a "unresolved" state in which they're no more than a combination of dataset type and data ID).
Tip I just learned is that when processing the ci_hsc_gen3
dataset, the butler is evolving quickly enough that one needs to take some care to align your version of Stack with the version of ci_hsc_gen3
. This can be done by checking out the weekly tag of ci_hsc_gen3
, e.g.,
git checkout w.2020.16
I suggest that we add this to the description at the top of the notebook, along with the standard line to check what version of the stack you are using, e.g.,
eups list -s | grep lsst_distrib
I guess you just want us to emphasize this again?
The correspondence between the release used to generate the repo and the release used to run the notebook has been discussed extensively (in fact, the repo name explicitly includes the release version). All stack club notebooks should include the release version at the top (as this notebook does), and I think we state clearly in some general instructions that notebooks are not guaranteed to work with other versions.
Gen-3 butler development is moving so rapidly, that there is no way that we are going to keep this notebook at the bleeding edge.
At some point I will get around to merging this PR, but I got stalled because one of the suggestions from Jim did not execute properly (this may again be a version issue).
@bechtol I added an example to query by filter. I am going to merge this PR, but I would not object to someone else continuing to develop this notebook.
@DouglasLeeTucker I've started a branch and notebook for us to collaborate on as we learn more about the Gen-3 butler.
This PR is READY to merge