desihub / desisim

DESI simulations
BSD 3-Clause "New" or "Revised" License
16 stars 22 forks source link

TEMPLATETYPE #360

Closed martinjameswhite closed 6 years ago

martinjameswhite commented 6 years ago

The current format of TEMPLATETYPE is a bit of a pain, requiring code like

dat["TEMPLATETYPE"]==bytes("ELG ",encoding="ASCII")

. Would it be possible to change this to an integer, or something easier to deal with? Some of our tools are reading this through several layers (e.g. dask) and the more complicated queries hit edge-cases.

moustakas commented 6 years ago

Thanks for the comment, @martinjameswhite.

This choice was discussed in https://github.com/desihub/desitarget/issues/149. I'm ambivalent about what we should do, although I agree that variable-length strings (e.g., STAR vs ELG) are a pain to deal with.

For the data challenge catalogs, TEMPLATETYPE gets set in desitarget.mock.mockmaker, so this issue should probably be posted in desitarget.

Finally note that if we changed TEMPLATETYPE we would also have to change TEMPLATESUBTYPE (which distinguishes DA vs DB white dwarfs, Lya quasars, and the spectral types of stars), which probably complicates matters.

@sbailey should chime in on how he'd like to proceed.

sbailey commented 6 years ago

I agree that the trailing spaces are a pain. They are a side effect of using fitsio for writing the files at some point in the chain. See esheldon/fitsio#42 and esheldon/fitsio#99 for background on why it pads with spaces instead of nulls (sample quotes below). If we can stomach the performance hit, at minimum I'd like to use astropy.io.fits to write these tables without the trailing spaces.

I'm less enthusiastic about going all the way to integers. Whether that is "easier" is in the eye of the beholder. e.g. this seems pretty easy and obvious to me:

dat["TEMPLATETYPE"] == b"ELG"

vs. most obscure

dat["TEMPLATETYPE"] == 4

If the arguement is that we wouldn't use a hardcoded number anyway but use something like

dat["TEMPLATETYPE"] == templatetypes.ELG

Then one could just define templatetypes.ELG = b"ELG" anyway.

Countering my own argument: There is the pain that in python 3 fitsio doesn't upcast from bytes to strings but astropy.io.fits does, such that you have to know whether your input data came from astropy or fitsio to know whether to compare against b"ELG" vs. "ELG" even if we solve the extra spaces issue. But managing integers for stellar subtypes seems an alternate form of pain.

Other opinions / insights?


sample quotes about c/fitsio padding strings with spaces when writing:

Definitely not going to get help from cfitsio here. I'm not sure how we could possibly make this work, since cfitsio internally removes the '\0' and pads with spaces

cfitsio does not follow the standard here. I've tried to fix it up by hacking cfitsio but then lots of things break.

rainwoodman commented 6 years ago

What about fixed 4/8 bytes integer that looks like 'ELG' in bigendian or little-endian.

The number is then not exactly difficult to remember. e.g.

In [14]: numpy.frombuffer(b'LRG ', dtype='>i4') Out[14]: array([1280460576], dtype=int32)

In [15]: numpy.frombuffer(b'ELG ', dtype='>i4') Out[15]: array([1162626848], dtype=int32)

In [16]: numpy.frombuffer(b'QSO ', dtype='>i4') Out[16]: array([1364414240], dtype=int32)

or better, one can get rid of numpy dependency here:

In [22]: int.from_bytes(b'QSO ', 'big') Out[22]: 1364414240

rainwoodman commented 6 years ago

BTW, I think astropy.fits is incorrect if it decodes bytes into text. I suppose the fits standard is pre-utf8, so any automatic decoding will break if say someone stored GBK-encoded text in their fits file (not sure if people do that). It's more practical to leave the mistakes to the downstream application in this case.

martinjameswhite commented 6 years ago

I guess I'd be perfectly happy being able to say dat["TEMPLATETYPE"] == templatetypes.ELG however you prefer to implement that. Since a string never seems to be a string these days I was suggesting integers as something that we could all agree on. I'm happy with numpy.frombuffer(b'QSO',dtype='>i4') too.

martinjameswhite commented 6 years ago

And just to contradict the patently absurd notion of me ever being "happy", I am wondering why templatetypes.ELG and not templatetypes['ELG']? I can imagine it being a little easier to batch process things with the latter rather than the former.

rainwoodman commented 6 years ago

tab completion?

martinjameswhite commented 6 years ago

I was imagining a script which looped over a number of target types, or spread the computation of different target types to different ranks...

rainwoodman commented 6 years ago

[ELG, LRG] vs ['ELG', 'LRG']?

Perhaps one just need a variable or a member on the enumeration type that yields all possible target types.

Also notice int.from_bytes(b'QSO ', 'big') works the same without pulling in numpy. So the magic number is not as ugly.

moustakas commented 6 years ago

I'd like to stick with using strings to encode TEMPLATETYPE and TEMPLATESUBTYPE for the sake of clarity and continuity of the current data model. Hopefully the workarounds have not been too painful.