Control which SED goes in the catalog

cdeil commented 6 years ago

We have to implement a way to control which SED goes in the catalog.

Yesterday I made this change: https://github.com/gammapy/gamma-cat/blob/ba11b084a1342d64ca80c35a8b46da2f59f20889/gammacat/catalog.py#L604

As you can see in these updated examples, there were cases before and after where the SED doesn't match the spectral model we have in the catalog: https://github.com/cdeil/gamma-cat-status/commit/218dfa996b08f468333f27f129d7ead81b5e568b

This is a reminder issue for myself to implement a good solution for this and to document it properly.

pdeiml commented 6 years ago

I read through the code which makes the collection and the catalog and I have a few suggestions.

We have https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml which is intended to control which datasets (tev-*.yaml) go into the collection. But - as you can see in https://github.com/gammapy/gamma-cat/blob/master/gammacat/collection.py#L275 - it is not used to control it. The datasets which are being written to the collection are searched by going through the info.yaml files. I think this is not the idea of https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml Moreover, https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml is not used for the seds and lightcurves, too.

Therefore, my suggestion is the following: We keep the CollectionMaker as it is, meaning that the list of leds, seds and datasets which go into the collection are produced by scanning through the info.yaml files and the folders, respectively (I don't see a reason why a sed, lc, dataset which exists in the input should not go to the collection). To control which sed, lc, dataset goes to the catalog, we use https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml for the datasets and we write two new yaml-files gamma_cat_lightcurve.yaml and gamma_cat_sed.yaml. Another way to implement it whould be to only have one file for the control of the catalog which could be formatted like this (the references are arbitrary, I only want to show the format):

source_id: 1 status: source sed: 2013ApJ...764...38A lc: 2015ApJ...456..40H dataset: 2017A&A..56...40A

cdeil commented 6 years ago

@pdeiml - gamma_cat_dataset.yaml is the configuration file that controls the content of the catalog.

You can see that it's used from catalog.py here: https://github.com/gammapy/gamma-cat/search?utf8=%E2%9C%93&q=gamma_cat_dataset.yaml+&type=

Of course, a fix for this issue and generally improvements to the catalog making / configuration are very welcome! If you take on this task, please make small pull requests that improve code / docs / content of the catalog that are easy and quick to review for me. (and not a large restructuring / rewrite, together with a fix for this issue)

Since it seems it wasn't clear to you how the catalog is built, starting by improving the contributor docs (and maybe renaming gamma_cat_dataset.yaml to gamma_cat_catalog_config.yaml or something) on that point would probably be the best first step.

For users, the most useful improvement (apart from fixing this issue which SED is filled in the catalog) would be to explain what's in the catalog here: https://gamma-cat.readthedocs.io/data/catalog.html

Again, I would suggest to make small improvements, e.g. just to link to https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_columns.yaml and say that those columns are available, or add a link to https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml and explain that this controls the content of the catalog.

pdeiml commented 6 years ago

Well, gamma_cat_dataset.yaml is written down in catalog.py but if you take a closer look to the CatalogMaker, you see in https://github.com/gammapy/gamma-cat/blob/master/gammacat/catalog.py#L576 that not the gamma_cat_dataset.yaml in the input folder is used but the one in the output folder which is basically a sum of all output resources. This means that in fact input/gamma_cat_dataset.yaml is not used by the catalog (I am sure that make.py will break as soon as I delete the file but it is not really used).

But what about my suggestion above with one single file for all types of data and that we copy the datasets by scanning through the folders?

cdeil commented 6 years ago

That sounds like a regression bug then. Initially when I wrote and used this, gamma_cat_dataset.yaml from the input folder was used. Can you start by making a PR fixing this? In this case, please also update this output file (or all catalog output files) so that we can see what changes in the catalog: https://github.com/gammapy/gamma-cat/blob/master/output/gammacat.yaml

It would also be good to have some checks, e.g. you could add a test_catalog.py here: https://github.com/gammapy/gamma-cat/tree/master/gammacat/tests which reads the catalog and puts some asserts to make sure it's filled OK. Especially "regression tests", i.e. some information / number that is currently not filled correctly, is a good candidate to add an assert line.

But what about my suggestion above with one single file for all types of data and that we copy the datasets by scanning through the folders?

I don't understand. IMO the way it should work is that only input folder is scanned once to make an input index file. Then the output collection is made, starting with the input index. Then the catalog is made, from the output collection (via the output index), plus the config file what to put in the catalog. I thought it already worked that way; if not, please fix.

@pdeiml - I'm very busy this week, but if there's still questions, I'd prefer a 10 min call over Github back and forth.

pdeiml commented 6 years ago

I will take a closer look to this but I think that is a regression bug, yes.

I totally agree with you about the procedure at the end of your comment but I think that is not working like that. I will look at this problem as well the next days.

cdeil commented 6 years ago

@pdeiml - Thank you!

gammapy / gamma-cat

Control which SED goes in the catalog #219