Open cdeil opened 6 years ago
I read through the code which makes the collection and the catalog and I have a few suggestions.
We have https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml which is intended to control which datasets (tev-*.yaml) go into the collection. But - as you can see in https://github.com/gammapy/gamma-cat/blob/master/gammacat/collection.py#L275 - it is not used to control it. The datasets which are being written to the collection are searched by going through the info.yaml files. I think this is not the idea of https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml Moreover, https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml is not used for the seds and lightcurves, too.
Therefore, my suggestion is the following: We keep the CollectionMaker as it is, meaning that the list of leds, seds and datasets which go into the collection are produced by scanning through the info.yaml files and the folders, respectively (I don't see a reason why a sed, lc, dataset which exists in the input should not go to the collection). To control which sed, lc, dataset goes to the catalog, we use https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml for the datasets and we write two new yaml-files gamma_cat_lightcurve.yaml and gamma_cat_sed.yaml. Another way to implement it whould be to only have one file for the control of the catalog which could be formatted like this (the references are arbitrary, I only want to show the format):
@pdeiml - gamma_cat_dataset.yaml
is the configuration file that controls the content of the catalog.
You can see that it's used from catalog.py
here:
https://github.com/gammapy/gamma-cat/search?utf8=%E2%9C%93&q=gamma_cat_dataset.yaml+&type=
Of course, a fix for this issue and generally improvements to the catalog making / configuration are very welcome! If you take on this task, please make small pull requests that improve code / docs / content of the catalog that are easy and quick to review for me. (and not a large restructuring / rewrite, together with a fix for this issue)
Since it seems it wasn't clear to you how the catalog is built, starting by improving the contributor docs (and maybe renaming gamma_cat_dataset.yaml
to gamma_cat_catalog_config.yaml
or something) on that point would probably be the best first step.
For users, the most useful improvement (apart from fixing this issue which SED is filled in the catalog) would be to explain what's in the catalog here: https://gamma-cat.readthedocs.io/data/catalog.html
Again, I would suggest to make small improvements, e.g. just to link to https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_columns.yaml and say that those columns are available, or add a link to https://github.com/gammapy/gamma-cat/blob/master/input/gammacat/gamma_cat_dataset.yaml and explain that this controls the content of the catalog.
Well, gamma_cat_dataset.yaml
is written down in catalog.py
but if you take a closer look to the CatalogMaker, you see in https://github.com/gammapy/gamma-cat/blob/master/gammacat/catalog.py#L576 that not the gamma_cat_dataset.yaml
in the input folder is used but the one in the output folder which is basically a sum of all output resources.
This means that in fact input/gamma_cat_dataset.yaml
is not used by the catalog (I am sure that make.py
will break as soon as I delete the file but it is not really used).
But what about my suggestion above with one single file for all types of data and that we copy the datasets by scanning through the folders?
That sounds like a regression bug then. Initially when I wrote and used this, gamma_cat_dataset.yaml
from the input folder was used. Can you start by making a PR fixing this?
In this case, please also update this output file (or all catalog output files) so that we can see what changes in the catalog:
https://github.com/gammapy/gamma-cat/blob/master/output/gammacat.yaml
It would also be good to have some checks, e.g. you could add a test_catalog.py
here:
https://github.com/gammapy/gamma-cat/tree/master/gammacat/tests
which reads the catalog and puts some asserts to make sure it's filled OK.
Especially "regression tests", i.e. some information / number that is currently not filled correctly, is a good candidate to add an assert line.
But what about my suggestion above with one single file for all types of data and that we copy the datasets by scanning through the folders?
I don't understand. IMO the way it should work is that only input folder is scanned once to make an input index file. Then the output collection is made, starting with the input index. Then the catalog is made, from the output collection (via the output index), plus the config file what to put in the catalog. I thought it already worked that way; if not, please fix.
@pdeiml - I'm very busy this week, but if there's still questions, I'd prefer a 10 min call over Github back and forth.
I will take a closer look to this but I think that is a regression bug, yes.
I totally agree with you about the procedure at the end of your comment but I think that is not working like that. I will look at this problem as well the next days.
@pdeiml - Thank you!
We have to implement a way to control which SED goes in the catalog.
Yesterday I made this change: https://github.com/gammapy/gamma-cat/blob/ba11b084a1342d64ca80c35a8b46da2f59f20889/gammacat/catalog.py#L604
As you can see in these updated examples, there were cases before and after where the SED doesn't match the spectral model we have in the catalog: https://github.com/cdeil/gamma-cat-status/commit/218dfa996b08f468333f27f129d7ead81b5e568b
This is a reminder issue for myself to implement a good solution for this and to document it properly.