dkirkby / bossdata

Tools for accessing SDSS BOSS data
MIT License
1 stars 3 forks source link

Feeding bossquery results back to bossquery #61

Closed dcunning11235 closed 9 years ago

dcunning11235 commented 9 years ago

It would sometimes be convenient to be able to pass results generated by bossquery back to bossquery as --where values. E.g.

bossquery --what "PLATE,MJD,FIBER,OBJTYPE,CLASS,ZWARNING" --where "criteria!" --save things_wanted.dat

and later come back and be able to

bossquery --what "PLATE,MJD,FIBER,THING_ID,EBOSS0" --where edited_things_wanted.dat

This could be implemented by reading the column headers to build the where clause, and then by querying either (a) by each set of values or (b) using the IN operator in SQL if SQLite supports it.

dkirkby commented 9 years ago

The predecessor to bossquery was a C++ program called bossfilter that supported this type of progressive filtering. You can see some usage examples here. However, that code used ROOT trees instead of SQL, so this would require a new approach in bossquery.

I briefly thought about adding this functionality when writing the initial version of bossquery but didn't implement it because I imagined that the role of bossquery is primarily for doing a first pass on the 2.5 million rows in spAll in a reasonable amount of time. Once you have one or more bossquery output files with <100K rows, then it is much more flexible to write a small python script to implement complex selection and query logic than to cover all possible scenarios in bossquery. I think this applies to your use case, but could be convinced otherwise.

I think the first action item here is to try and write a simple python script that reads your things_wanted.dat and implements the equivalent of your second command.

dcunning11235 commented 9 years ago

I wasn't so much thinking of additional selection criteria so much as having the ability to re-run a query but with different columns selected for output (I may be quibbling on the use of 'criteria'.) I actually threw in the changed filename edited_things_wanted.dat because it occurred to me at the last second that someone might have processed the file in some way; that wasn't central to my original thought.

So, ignoring that, this amounts to being able to run a query once and then run a new query with, effectively, the same where clause, but based only on the output of the first query. I'm not sure this would be useful for very general tasks ("What are the criteria for this subset of data?" "It's... this data." Probably not a good conversation.) But it's good for quick-and-dirty-and-lazy exploration ("I have blah-blah-blah, let me just add ZWARNING and OBJTYPE to that.")