gammapy / gamma-cat

An open data collection and source catalog for gamma-ray astronomy
https://gamma-cat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 17 forks source link

Issue 126 #127

Closed pdeiml closed 7 years ago

pdeiml commented 7 years ago

I added two files which check for the following consistencies:

Between the references given in ./input/sources/*.yaml and the files in ./input/data Between the name of the folder in ./input/data and the reference_id given in the files in that folder.

The two files are - I think - well commented and you should see what they do fastly.

We have to discuss/decide whether to put these checks in ./gammacat/checks.py or ./gammacat/input.py or in some other file. As soon as I know where to put them, I will do that :-)

cdeil commented 7 years ago

@pdeiml - Thanks!

To be honest, I'm not sure if reviewing / improving this code is the best use of your / my time.

My suggestion would be that you finish the Biteau PR first and only if you have a few days of time to give to gammacat to start learning it's code and contributing to it.

The problem is that at the moment, the code in gammacat/input.py is pretty messy. I would like to refactor it to work by first creating an index of available resources (use the existing gammapy.catalog.gammacat.GammaCatResourceIndex), and to then do all data processing / validation with that. I did start to think about how gammacat should work and to write up some notes at https://gamma-cat.readthedocs.io/contribute/code.html, but as I said, at the moment it's quite a mess.

Your current scripts are spagetti code (i.e. no functions or classes, just code exectuting top to bottom) that don't re-use any existing code in gammacat (e.g. we use pathlib.Path everywhere, not os.path, and there are already glob patterns in gammacat/input.py to collect all files and that code shouldn't be duplicated but called).

That said, if you want to spend the time to get this PR in, then let me know and I'll send comments. Roughly, for now, the consistency check on input file names and content should go into the existing validate methods in gammacat/input.py, and the consistency check that all reference_id appear in ./input/sources/*.yaml should go in gammacat/checks.py, like mentioned here: https://github.com/gammapy/gamma-cat/issues/126#issuecomment-305508149

Out of curiosity: did you find any issues with these checkers? If yes -- could you please make a separate PR with the fixes?

cdeil commented 7 years ago

Just one more suggestion, whether you continue with this PR or not: get in the habit of writing functions or classes: in this case you have several top-level for something in some_list: check(something) blocks: always put everything in small functions, especially the code in the for block that processes one thing makes for a nice function.

pdeiml commented 7 years ago

Hi,

my intention about this PR is that will be very useful for the Biteau PR. Assume that in the script which adds the Biteau SEDs to /input/data there is a small mistake about e.g. the source_id in the file and the name of the file. With this PR here such a mistake will be found very very easy.

Indeed, I found one issue with this checkers: The reference_id in https://github.com/gammapy/gamma-cat/blob/master/input/data/2011/2011ApJ...738....3A/tev-000014-lc.ecsv is not the same as the name of the folder. I think the reference in the file is the correct one, isn't it?

For adding this code in gammacat/input.py and gammacat/check.py I need some information: Should the reference_id check be a additional argument of "make.py check", in other words should it be an additional elif in class checker? Other possibility is to add it to one of the other arguments "input, collection, catalog, global".

The other check, I think, should be part of the class InputData in input.py, shouldn't it? My idea is to put it as a additional line into InputData::validate(self) or to add it to "schemas.validate(), sources.validat(), datasets.validate()".

But for that I don't know enough about the intention of all these functions and validation functions.

cdeil commented 7 years ago

I think the reference in the file is the correct one, isn't it?

I'm not sure. I've split this out into #128, asking Gernot who originally added the data.

Should the reference_id check be a additional argument of "make.py check", in other words should it be an additional elif in class checker? Other possibility is to add it to one of the other arguments "input, collection, catalog, global".

I would suggest to do it as part of the "global" check. The idea there was to do "global" checks, i.e. such checks that involve multiple files / folders for consistency, just like what you're doing here.

The other check, I think, should be part of the class InputData in input.py, shouldn't it? My idea is to put it as a additional line into InputData::validate(self) or to add it to "schemas.validate(), sources.validat(), datasets.validate()".

I don't have a strong opinion here, put it where you like. As I said, I plan to do a big restructuring of input.py to change it to create an index of all files first and then to do the processing based on that index.

cdeil commented 7 years ago

I would suggest to close this PR, and then when I have time, I implement these checks into the scripts in gammacat that run via make.py. We still have #126 open as a reminder.