gammapy / gamma-cat

An open data collection and source catalog for gamma-ray astronomy
https://gamma-cat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 17 forks source link

Easiest way to access all publications of a gamma-cat object #194

Open GernotMaier opened 6 years ago

GernotMaier commented 6 years ago

I am trying to access all published data of a certain object, but I am not entirely sure how to do this. This is what I've started with:

import numpy as np
import gammapy
from gammapy.catalog import SourceCatalogGammaCat
gammacat = SourceCatalogGammaCat()
gammacat_source = gammacat['HESS J0632+057']
for k, v in gammacat_source.data.items():
    print(k,v)

This prints me only the values from the most recent publication in gamma-cat (in this case from 2017arXiv170804045M).

What is the easiest way to loop over all publications on an object (e.g. HESS J0632+057)?

pdeiml commented 6 years ago

I think this is not being so easy right now and impossible via python and gammapy.

You can search here https://gammapy.github.io/gamma-cat/sources.html and search the id of the source you are interested in. After that you can go here https://github.com/gammapy/gamma-cat/blob/master/docs/data/gammacat-datasets.json where all datasets of all sources are listed.

@cdeil Maybe we should change the layout of the index file. It is being an array of ordered dicts each one representing a dataset. If we change the layout so that every dict corresponds to one source, then one can do something like this:

import gammapy
from gammacat.utils import load_json
index = load_json('./docs/data/gammacat-datasets.json')
index[1]['reference_id']

Is that a good idea?

cdeil commented 6 years ago

I think what @GernotMaier wants is already easily possible: http://docs.gammapy.org/dev/api/gammapy.catalog.GammaCatResourceIndex.html#gammapy.catalog.GammaCatResourceIndex.query

Although I'm sure that this needs to be extended, I don't think anyone used that, we just introduced the index files recently. I'm not even sure if all data is already in the output folder and listed in the index file.

Like I said in the call, I'll make a tutorial notebook how to use it soon. Assigning this issue to myself. I was hoping to get to it today, Monday at the latest.

cdeil commented 6 years ago

@pdeiml - I don't think your suggestion to change the index file format is useful. People want to query and select subsets in different ways. Hard-coding a format where the first key is the source name, or the reference or something else, is very nice for one use case, but not for the others. So what we have now, a flat list of table, and then either pre-code or show people how to do things with 1-2 lines of Python for the different use cases is the best we can do I think.

cdeil commented 6 years ago

Here's an example: https://github.com/cdeil/gamma-cat-status/blob/cc3084fc9b43933a93203275eed7385d322efb82/notebooks/data_collection_example.ipynb

It's very rough, but might be of help for now while we work on gamma-cat. Concretely it shows that

cdeil commented 6 years ago

@GernotMaier - We now also have http://gamma-cat.readthedocs.io/use/source_list.html and on the source detail pages like e.g. http://gamma-cat.readthedocs.io/use/sources/79.html a list of available resources in gamma-cat.

Unfortunately it's still buggy, i.e. not reliable at the moment. But easy to fix: https://github.com/gammapy/gamma-cat/issues/198#issuecomment-361191490

cdeil commented 6 years ago

PS: I think the issue with string "scale" entries mentioned above is resolved. I searched the input YAML files and couldn't find any.

cdeil commented 6 years ago

@micheledoro - this is how to get a list of MAGIC data in gamma-cat for now: https://gist.github.com/cdeil/90c282bee5d5644630085afad0bac313

I'm posting it here, because it might be of interest to others. For now, the way to do it is always to take the index of available files and then filter the ones you're interested in (using a pandas DataFrame is convenient for that).

micheledoro commented 6 years ago

Hi @cdeil, I was trying to use the notebook you generated, but after

for dataset in datasets:
    dataset.update(get_info(dataset))

I have the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-d794a293c914> in <module>()
      1 for dataset in datasets:
----> 2     dataset.update(get_info(dataset))

<ipython-input-4-4b8dd313a8bf> in get_info(dataset)
      5         meta = Table.read(filename, format='ascii.ecsv').meta
      6     else:
----> 7         meta = yaml.load(open(filename))
      8 
      9     return dict(

/anaconda3/envs/gammapy-tutorial/lib/python3.6/site-packages/yaml/__init__.py in load(stream, Loader)
     68     and produce the corresponding Python object.
     69     """
---> 70     loader = Loader(stream)
     71     try:
     72         return loader.get_single_data()

/anaconda3/envs/gammapy-tutorial/lib/python3.6/site-packages/yaml/loader.py in __init__(self, stream)
     32 
     33     def __init__(self, stream):
---> 34         Reader.__init__(self, stream)
     35         Scanner.__init__(self)
     36         Parser.__init__(self)

/anaconda3/envs/gammapy-tutorial/lib/python3.6/site-packages/yaml/reader.py in __init__(self, stream)
     83             self.eof = False
     84             self.raw_buffer = None
---> 85             self.determine_encoding()
     86 
     87     def peek(self, index=0):

/anaconda3/envs/gammapy-tutorial/lib/python3.6/site-packages/yaml/reader.py in determine_encoding(self)
    122     def determine_encoding(self):
    123         while not self.eof and (self.raw_buffer is None or len(self.raw_buffer) < 2):
--> 124             self.update_raw()
    125         if isinstance(self.raw_buffer, bytes):
    126             if self.raw_buffer.startswith(codecs.BOM_UTF16_LE):

/anaconda3/envs/gammapy-tutorial/lib/python3.6/site-packages/yaml/reader.py in update_raw(self, size)
    176 
    177     def update_raw(self, size=4096):
--> 178         data = self.stream.read(size)
    179         if self.raw_buffer is None:
    180             self.raw_buffer = data

/anaconda3/envs/gammapy-tutorial/lib/python3.6/encodings/ascii.py in decode(self, input, final)
     24 class IncrementalDecoder(codecs.IncrementalDecoder):
     25     def decode(self, input, final=False):
---> 26         return codecs.ascii_decode(input, self.errors)[0]
     27 
     28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 274: ordinal not in range(128)
cdeil commented 6 years ago

@micheledoro - I don't see that error. Maybe you added a dataset locally and it contains a non-ascii character?

Can you change to

for dataset in datasets:
    print(dataset['location'])
    dataset.update(get_info(dataset))

and see which file has the problematic character?

If it's not obvious which character it is, you can paste it in https://gist.github.com/ and I'll have a look.

There's also things like https://pteo.paranoiaworks.mobi/diacriticsremover/ that remove non-ascii characters for you. Usually it 's a long dash in source names copied PDFs or something like that "MAGIC BLA–BLA"

micheledoro commented 6 years ago

Hi,

I tried with print(dataset['location']), but I cannot see issues in the output, and by the way, I had not added any folder...see https://gist.github.com/micheledoro/fcf339860c312f323772f32ab438d40c

cdeil commented 6 years ago

@micheledoro - I don't quite understand. Is the issue gone, or is it still there for you?

micheledoro commented 6 years ago

Hi. What i meant is that I still have the same problem even if the output looks fine, see the link.

cdeil commented 6 years ago

@micheledoro - From the traceback you showed above, the last line in your code that is executed is this one:

<ipython-input-4-4b8dd313a8bf> in get_info(dataset)
      5         meta = Table.read(filename, format='ascii.ecsv').meta
      6     else:
----> 7         meta = yaml.load(open(filename))

Thus my suggestion to print the filename.

This should print the filename of the proplematic file, and then the error should appear and the traceback be printed. Is this not the case!???

What you pasted here, the last line is an ECSV file: https://gist.github.com/micheledoro/fcf339860c312f323772f32ab438d40c#file-print-dataset-location-L585

I don't understand what's going on ...