kadrlica commented 5 years ago

@DouglasLeeTucker I've started a branch and notebook for us to collaborate on as we learn more about the Gen-3 butler.

This PR is READY to merge

TallJimbo commented 4 years ago

I've just merged [DM-21246|https://jira.lsstcorp.org/browse/DM-21246], which allows a butler to be constructed with no run or collection argument, so after the next weekly goes out it should be safe to update the notebook tutorial.

That should also include better diagnostics related to read-only repository errors, especially if you explicitly pass writeable=True or writeable=False when constructing the Butler (if you don't, it will create a writeable butler only if you pass a run).

kadrlica commented 4 years ago

@bechtol here is a new version of the notebook. We should ask @TallJimbo to take a look, and if the syntax is stable enough, I'd suggest merging this into master.

TallJimbo commented 4 years ago

A few comments:

The current definition of repo points at the ci_hsc_gen3 git repo, which is not a Gen3 data repo. I think things would be much clearer if you just ignored that directory and defined everything relative to the DATA subdirectory within it (which is a Gen3 data repository).
With that in place, I would recommend initializing the Butler instance with just butler = Butler(repo) (assuming repo == os.path.join(os.environ["CI_HSC_GEN3_DIR"], "DATA")). Passing the full name of the config file works, but is more advanced usage, and since it's usually a required argument I think establishing a convention of passing it positionally is a good idea.
Setting the collections attribute as in cell 19 is not recommended; it skips some checks that happen at Butler construction that would provide better error messages than what you'd otherwise get. It sounds like I might want to go make that a property to add that checking, but I think it's best to err on the side of caution in general about modifying attributes in Python: if you don't know it's safe, don't do it, because the language has no way of distinguishing for users (even by convention) which attributes are conceptually settable. In this case (and until that property is operational), I would recommend:
```
butler = Butler(butler=butler, collections=collections)
```
(yes, this is the rare case where the first config argument is not necessary, which doesn't help my previous argument about passing it positionally, but I think the case where you don't know what collection you're starting with is unlikely to happen often outside examples, and adding a collections property would take care of that, too).

Other than that last issue, I think the syntax here is all pretty stable. I would expect the optional arguments of the various query methods to see some minor changes, but probably not the default behavior.

kadrlica commented 4 years ago

Thanks @TallJimbo I've implemented your suggestions.

Regarding this: "I think the case where you don't know what collection you're starting with is unlikely to happen often outside examples, and adding a collections property would take care of that, too"

Why do you say that it is unlikely to not know what collections you are starting with? It seems like for the immediate future, someone just giving you the path to a repo is fairly common?

TallJimbo commented 4 years ago

Because a Gen2 repo maps more closely to a collection than a Gen3 repo, I think there are going to be very few Gen3 repos in the wild outside of those used in CI or in small-scale development on people's laptops (e.g. perhaps one or two for all of /datasets on NCSA GPFS) . So when transferring knowledge about processing outputs between people, the repo will generally be implied from context, and the collection will be what they tell you.

kadrlica commented 4 years ago

I see. So the use case may be more like "I think Keith has been working on something cool, let me see if I can find the collection he was working off of" (the way that I might explore his "work" directory on a shared disk).

kadrlica commented 4 years ago

@DouglasLeeTucker @bechtol I think this is ready for review.

TallJimbo commented 4 years ago

Yes, good point - that kind of use case is still going to involve butlers without collections. I'd normally expect that use case to be better supported by command-line tools that do that kind of querying, but I suppose that may not be the case in an environment like the LSP where there's a strong weight towards using notebooks over command-line tools.

kadrlica commented 4 years ago

I thought there was also no guarantee that the repo will be on the local disk. I guess there will be command line tools for accessing the structure of remote repos?

TallJimbo commented 4 years ago

I guess there will be command line tools for accessing the structure of remote repos?

Yes, exactly.

DouglasLeeTucker commented 4 years ago

OK, will look at it soon! --DLT

--

Douglas L. Tucker Fermilab Tel: +1-630-840-2267 MS 127 FAX: +1-630-840-8274 PO Box 500 E-mail: dtucker@fnal.gov Batavia, IL 60510 USA http://home.fnal.gov/~dtucker/

From: Alex Drlica-Wagner notifications@github.com Reply-To: LSSTScienceCollaborations/StackClub reply@reply.github.com Date: Friday, April 17, 2020 at 12:58 PM To: LSSTScienceCollaborations/StackClub StackClub@noreply.github.com Cc: Douglas L Tucker dtucker@fnal.gov, Mention mention@noreply.github.com Subject: Re: [LSSTScienceCollaborations/StackClub] Gen-3 butler tutorial (#225)

@DouglasLeeTuckerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DouglasLeeTucker&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=FNAzyPnzcjNEyqXtzvBQkxXwKOVWLuTvfIKmY1AjRkg&e= @bechtolhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bechtol&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=86Yr4ZFrIfGP6kjvW3rctisuE_0SqaFwLgrX_LEtw4A&e= I think this is ready for review.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LSSTScienceCollaborations_StackClub_pull_225-23issuecomment-2D615384353&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=vpsJ_qd3fT6IcdGi3CIeQHEqAsu-cS1gJcw6Bqsn0po&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AARAHOMUDOVWKMBXXOZFJSLRNCKCNANCNFSM4I5SSKQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=kro7HiaDsdS4pc1-BnU9Tg6NjURL9ix0r4C0vmpJUYA&e=.

DouglasLeeTucker commented 4 years ago

Dear Alex,

I ran your update of the Gen-3 butler tutorial, and it worked fine!

While running it, though, I found 2 or 3 few very minor typos, all in this one section of text (3 typos if you consider “a butler” instead of “the butler” a typo 😊 , but 2 typos otherwise):

“To create a butler you need to pass it a configuration file and a run name. The run name tells the butler where the place output files. More on Butler configuration can be found herehttps://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html. By investigating the directory structue, we find that the 'collection' is…” • “To create the butler you need to pass it a configuration file and a run name. The run name tells the butler where to place output files. More on Butler configuration can be found herehttps://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html. By investigating the directory structure, we find that the 'collection' is…”

I am not sure if you fix those typos now. If not, I can go ahead and OK the pull request.

Thanks!

Best regards, Douglas

--

Douglas L. Tucker Fermilab Tel: +1-630-840-2267 MS 127 FAX: +1-630-840-8274 PO Box 500 E-mail: dtucker@fnal.gov Batavia, IL 60510 USA http://home.fnal.gov/~dtucker/

From: Alex Drlica-Wagner notifications@github.com Reply-To: LSSTScienceCollaborations/StackClub reply@reply.github.com Date: Friday, April 17, 2020 at 12:58 PM To: LSSTScienceCollaborations/StackClub StackClub@noreply.github.com Cc: Douglas L Tucker dtucker@fnal.gov, Mention mention@noreply.github.com Subject: Re: [LSSTScienceCollaborations/StackClub] Gen-3 butler tutorial (#225)

@DouglasLeeTuckerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DouglasLeeTucker&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=FNAzyPnzcjNEyqXtzvBQkxXwKOVWLuTvfIKmY1AjRkg&e= @bechtolhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bechtol&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=86Yr4ZFrIfGP6kjvW3rctisuE_0SqaFwLgrX_LEtw4A&e= I think this is ready for review.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LSSTScienceCollaborations_StackClub_pull_225-23issuecomment-2D615384353&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=vpsJ_qd3fT6IcdGi3CIeQHEqAsu-cS1gJcw6Bqsn0po&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AARAHOMUDOVWKMBXXOZFJSLRNCKCNANCNFSM4I5SSKQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=R-57Mm7OJX1IswbY4ezVTQ&m=AJIdw16wQ0BoKPrti_iy0j8P1K9h2YZylYiIxTS3Cr0&s=kro7HiaDsdS4pc1-BnU9Tg6NjURL9ix0r4C0vmpJUYA&e=.

kadrlica commented 4 years ago

Thanks @DouglasLeeTucker! Changed two of those typos, but I think I'm missing the third. Feel free to change the text directly on github if you like, otherwise I think you can approve the pull request.

DouglasLeeTucker commented 4 years ago

Looks good! Thanks!

bechtol commented 4 years ago

For completeness, it might be useful to show how to get the dataIds with the Gen-3 butler, for example

dataids = [x.dataId for x in registry.queryDatasets('calexp', collections='shared/ci_hsc_output')]
dataId = dataids[0]
calexp = butler.get('calexp', dataId=dataId)

TallJimbo commented 4 years ago

Also:

refs = list(registry.queryDatasets("calexp", collections="shared/ci_hsc_output")
calexp = butler.getDirect(refs[0])

This avoids repeating some lookups that were already done in the original query, and it ensure that the dataset you found is the dataset you load - which might not be the case otherwise if the collections you pass to queryDatasets differ from butler.collections.

Another variant:

dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp"], collections="shared/ci_hsc_output"))
calexp = butler.get('calexp', dataId=dataIds[0])

This one does behave like the original in terms of doing a second lookup with possibly-different collections, but the original query is more flexible in that you could provide multiple datasets (requiring an instance of all datasets to be available for that data ID) or ask for different data ID keys than what is used to identify the dataset, which will invoke various built-in relationships.

kadrlica commented 4 years ago

Sorry, stupid shift+enter from slack...

Do I need to use butler.getDirect or is get smart enough to identify that I've given it a dataRef and pass it to getDirect?

bechtol commented 4 years ago

Also, it would also be nice to show an example of syntax to select a subset of dataIds, e.g., the calexp's for a particular filter

TallJimbo commented 4 years ago

Do I need to use butler.getDirect or is get smart enough to identify that I've given it a dataRef and pass it to getDirect?

get can take a DatasetRef, and it also will check that the one retrieved from the butler's collection is the same one fully identified by the DatasetRef if DatasetRef.id is not None (DatasetRefs can either be in a "resolved" state in which they have an ID that fully identifies them independent of collection, or in a "unresolved" state in which they're no more than a combination of dataset type and data ID).

bechtol commented 4 years ago

Tip I just learned is that when processing the ci_hsc_gen3 dataset, the butler is evolving quickly enough that one needs to take some care to align your version of Stack with the version of ci_hsc_gen3. This can be done by checking out the weekly tag of ci_hsc_gen3, e.g.,

git checkout w.2020.16

I suggest that we add this to the description at the top of the notebook, along with the standard line to check what version of the stack you are using, e.g.,

eups list -s | grep lsst_distrib

kadrlica commented 4 years ago

I guess you just want us to emphasize this again?

The correspondence between the release used to generate the repo and the release used to run the notebook has been discussed extensively (in fact, the repo name explicitly includes the release version). All stack club notebooks should include the release version at the top (as this notebook does), and I think we state clearly in some general instructions that notebooks are not guaranteed to work with other versions.

Gen-3 butler development is moving so rapidly, that there is no way that we are going to keep this notebook at the bleeding edge.

At some point I will get around to merging this PR, but I got stalled because one of the suggestions from Jim did not execute properly (this may again be a version issue).

kadrlica commented 4 years ago

@bechtol I added an example to query by filter. I am going to merge this PR, but I would not object to someone else continuing to develop this notebook.

LSSTScienceCollaborations / StackClub

Gen-3 butler tutorial #225

--

Douglas L. Tucker Fermilab Tel: +1-630-840-2267 MS 127 FAX: +1-630-840-8274 PO Box 500 E-mail: dtucker@fnal.gov Batavia, IL 60510 USA http://home.fnal.gov/~dtucker/

--

Douglas L. Tucker Fermilab Tel: +1-630-840-2267 MS 127 FAX: +1-630-840-8274 PO Box 500 E-mail: dtucker@fnal.gov Batavia, IL 60510 USA http://home.fnal.gov/~dtucker/