Major Updates to DataIO Class

ckarwin commented 8 months ago

This PR has significant updates to the DataIO class, as summarized below.

update 1: I changed the way in which output files are written. Previously, they were written by default. Now, the user has the option to write the output to file, and they must also specify the name of the output. Generally, this is done with the "output_name=my_file" option. This addresses the following open issues: https://github.com/cositools/cosipy/issues/63 https://github.com/cositools/cosipy/issues/49

update 2: Chunking of the UnBinnedData.py class. This class reads the photon data from a tra file and writes it to either a fits or h5 file. This will not be need for most users, as they will start with the unbinned fits file. But nevertheless, reading large tra files can be problematic, especially for users with basic laptops. I've added an option to do this in chunks. Specifically, the user can now specify 'event_min' and 'event_max', which are the min and max event numbers that will be read. This allows a user to chunk the raw tra file as needed. Note that the UnBinnedData.py class also has a method for combining multiple fits files: combine_unbinned_data.

update 3: Connected to the spacecraft file. The dataIO has been getting its pointing information directly from the tra file. I've now given the option to instead get the pointing information from the spacecraftfile class, i.e. based on an input orientation file. This option is available by specifying 'use_ori=True' in read_tra. Note that for now I have left use_ori=False as the default, in which case the pointing information comes directly from the tra simulation file. I've done some testing to verify that the ori file gives the same results, however, I think a bit more testing could be warranted, which is why I've left it as an option for now.

In order to make this work I needed to make a small update to SpaceCraftFile.py: In the parse_from_file method I added 'comments=("#","EN")' in np.loadtxt. This interprets "EN" as a comment, in addition to "#". This is needed b/c most ori files end with "EN".

update 4: Added units to binned histogram, and also made psichi axis a healpix object. For now I have also left the old method, just in case it's needed for some sanity checks. By default the new method is used, but the old method can be used by specifying psichi_old=True. Again, I will eventually remove this once we now the updated method is working ok. I directly compared the new method to the old method, for both Galactic and local coordinates, and verified that they are identical. This addresses the following open issues: https://github.com/cositools/cosipy/issues/76 https://github.com/cositools/cosipy/issues/73

update 5: I updated the dataIO test in cosipy/tests/dataIO. Note: For testing the unbinned updates of this PR one can simply run test_unbinned_data_with_MEGAlib.py.

The following open issues are also taken care of with this PR: https://github.com/cositools/cosipy/issues/66 https://github.com/cositools/cosipy/issues/90

I still have the following updates to make: Implement the yaml configurator Initialize binning with response Add more unit tests Change print to logging info Add kinematic tests to the UnBinnedDataClass (similar to the MEGAlib reader). These will be done before the release of DC2.

With the above updates, it's a good place to make a PR. Please let me know if there are any questions, issues, etc.

Yong2Sheng commented 8 months ago

Hi Chris, thanks for the pull request, it looks great! I only have a minor comment regarding the feature:

Specifically, the user can now specify 'event_min' and 'event_max', which are the min and max event numbers that will be read.

This is a good feature to improve the event reading. But when a user gets a tra file, how does the user get the knowledge of the number of events he/she will need? It looks like a similar feature to the time cut, which is more intuitive for the user to make the decision of the events to extract.

Also there might be a confusion raises by the term "event_min", "event_max" and "min and max event numbers" Here the "numbers" are actually the number ID to each event starting from the first event. It could be misunderstood as the total number instead of number ID. (The number ID is equivalent to N_events at each iteration in the for loop.)

So changing the description and the name of variables might help: event_min --> min_event_num_id event_max --> max_event_num_id min and max event numbers --> min and max event number ID

ckarwin commented 8 months ago

Hi Yong,

Thank you for your review.

This feature is just a way to allow for the data to be read in chunks, which is only needed to deal with memory issues, depending on the user's system. It is not really a similar feature as the time cut, which is a selection on the data that effects the data analysis.

"The number ID is equivalent to N_events at each iteration in the for loop" This is not necessarily true. In the simulation file this is true, because the number ID starts at 0 and increases linearly. However, the tra files are usually extracted with mimrec, which applies some cut on the simulated data, so the number IDs do not increase linearly. But anyways, the number IDs don't actually matter here.

"Here the 'numbers' are actually the number ID to each event starting from the first event. It could be misunderstood as the total number instead of number ID." I think you have this mixed up. The 'numbers' are indeed the total number of events. Again, the whole point here is to read in the data in smaller chunks.

"But when a user gets a tra file, how does the user get the knowledge of the number of events he/she will need?" In general the user should extract all the events. If they want to make some cuts on the data based on some specified selections, that is done in another function. To choose how many events to read in at once depends on what the user's system can handle, i.e. you only need this if you're trying to convert a large file and don't have enough RAM. In general though, most common users won't use this function, because they will be starting with the data in fits file formate.

I see that my docstring might be a bit confusing. What if I add a note like this:

Note: event_min and event_max correspond to the total number of events in the file, which is not necessarily the same as the event ID number. The purpose of this is to allow the data to be read in chunks, in order to overcome memory limitations, depending on the user's system.

Yong2Sheng commented 8 months ago

Hi Chris,

Thank you for the explanation! The new docstring makes more sense. Thank you!

ckarwin commented 8 months ago

Ok, thanks. It's been updated.

hiyoneda commented 7 months ago

In UnBinnedData.py, at line 555, this_dict = get_dict_from_fits(each) should be this_dict = self.get_dict_from_fits(each)

ckarwin commented 7 months ago

@hiyoneda Ok, fixed now.

hiyoneda commented 7 months ago

I've checked the added functions, and it is ready to merge. One thing I noticed is that the pointing value changes a bit when using the orientation file. Using GalacticScan.inc1.id1.crab2hr.extracted.tra.gz and 20280301_first_2hrs.ori, I checked 'Zpointings (glon,glat)' for the first ten events with/without "use_ori", like

analysis = UnBinnedData(yaml)
analysis.data_file = os.path.join(test_data.path,analysis.data_file)
analysis.ori_file = os.path.join(test_data.path,analysis.ori_file)
analysis.read_tra(event_max = 10)
cosi_dataset = analysis.cosi_dataset
analysis.read_tra(event_max = 10, use_ori = True)
cosi_dataset_use_ori = analysis.cosi_dataset
cosi_dataset['Zpointings (glon,glat)'] - cosi_dataset_use_ori['Zpointings (glon,glat)']

The output of the last line is

array([[0.00041339, 0.00093654],
       [0.00041168, 0.0009327 ],
       [0.00040577, 0.00091946],
       [0.00040558, 0.00091903],
       [0.00040556, 0.00091898],
       [0.00040525, 0.0009183 ],
       [0.00040472, 0.00091711],
       [0.00040274, 0.00091266],
       [0.0004002 , 0.00090697]])

The difference is less than 0.1 degrees, much smaller than the COSI angular resolution, so I consider it does not matter, but want to share it.

ckarwin commented 7 months ago

Thanks @hirokiyoneda For future reference, the dataIO is using a linear interpolation between between time stamps in the ori file (which uses a 1s cadence). I'm not sure exactly what MEGAlib is doing, but based on your test, it looks like it's using the same pointing for each event between timestamps. The difference are small, but we should test this further in the future.

cositools / cosipy

Major Updates to DataIO Class #94