Tree file thing - Githubissues

dneise commented 7 years ago

WIP: Variant of the read_callisto question

there is an example on how to use this in the top of tree_file.py

In [1]: from read_mars import tree_file
In [2]: tree_file?

Have a look at it

dneise commented 7 years ago

I am working on these Problems right now

mblnk commented 7 years ago

I like the idea of reading everything at once, but there are several things I don't like about this solution:

There are a lot of useless entries in the returned dict -- all leaves ending with a "." -- Many of the "MGeomCam" things are just filled with a useless random number -- Probably more
The returned dict contains arrays of different shapes: for each tree: [n_entries], and possibly [n_entries, 1440]
The information, which leaf is from which tree is lost
You can now only read everything at once -- If you have several files and only need a few leaves of some tree, this is extremely slow and requires a lot of memory

Now you get back a dict with a seemingly endless list of keys with many useless entries, of which you have to make sense in some way or another.

Because pandas dataframes are way easier to handle and provide a lot more functionality, I would propose to return for each tree in the file:

only entries with [n_entries]: only return a pandas dataframe
[n_entries] and [n_entries, 1440]: return a pandas dataframe and a dict for leaves with [n_events, 1440].

This makes more sense to me, because for each tree in the file you get back two categories: Information per entry [n_events] or if available, information per pixel [n_events, 1440]

dneise commented 7 years ago

The returned dict contains arrays of different shapes: for each tree: [n_entries], and possibly [n_entries, 1440]

Even worse, for the example file, we provide for tests we get shapes like: (1,) or (912, ) or (912, 1440), where 912 is n_events.

example:

In [17]: for name, value in m.tree_file_to_dict('20171022_215_C.root').items():
    ...:     print(name, value.shape, value.dtype)
Error in <TTreeFormula::DefinedVariable>: fPixels_ is not a datamember of MSignalCam
Error in <TTreeFormula::Compile>:  Bad numerical expression : "MSignalCam.fPixels_"
Info in <TSelectorDraw::AbortProcess>: Variable compilation failed: {MSignalCam.fPixels_,}
MRawRunHeader. (1,) float64
MRawRunHeader.fMagicNumber (1,) uint16
MRawRunHeader.fHeaderSizeRun (1,) uint32
MRawRunHeader.fHeaderSizeEvt (1,) uint32
MRawRunHeader.fHeaderSizeCrate (1,) uint32
MRawRunHeader.fFormatVersion (1,) uint16
MRawRunHeader.fSoftVersion (1,) uint16
MRawRunHeader.fFadcType (1,) uint16
MRawRunHeader.fCameraVersion (1,) uint16
MRawRunHeader.fTelescopeNumber (1,) uint16
MRawRunHeader.fRunType (1,) uint16
MRawRunHeader.fRunNumber (1,) uint32
MRawRunHeader.fFileNumber (1,) uint32
MRawRunHeader.fProjectName (1,) int8
MRawRunHeader.fSourceName (1,) int8
MRawRunHeader.fObservationMode (1,) int8
MRawRunHeader.fSourceEpochChar (1,) int8
MRawRunHeader.fSourceEpochDate (1,) uint16
MRawRunHeader.fNumCrates (1,) uint16
MRawRunHeader.fNumPixInCrate (1,) uint16
MRawRunHeader.fNumSamplesLoGain (1,) uint16
MRawRunHeader.fNumSamplesHiGain (1,) uint16
MRawRunHeader.fNumBytesPerSample (1,) uint16
MRawRunHeader.fIsSigned (1,) bool
MRawRunHeader.fNumEvents (1,) uint32
MRawRunHeader.fNumEventsRead (1,) uint32
MRawRunHeader.fSamplingFrequency (1,) uint16
MRawRunHeader.fFadcResolution (1,) uint8
MRawRunHeader.fRunStart.fMjd (1,) uint32
MRawRunHeader.fRunStart.fTime.fMilliSec (1,) float64
MRawRunHeader.fRunStart.fNanoSec (1,) uint32
MRawRunHeader.fRunStop.fMjd (1,) uint32
MRawRunHeader.fRunStop.fTime.fMilliSec (1,) float64
MRawRunHeader.fRunStop.fNanoSec (1,) uint32
MRawRunHeader.fPixAssignment (1,) float64
MGeomCam. (1,) float64
MGeomCam.MGeomCam.fNumPixels (1,) uint32
MGeomCam.MGeomCam.fCamDist (1,) float32
MGeomCam.MGeomCam.fConvMm2Deg (1,) float32
MGeomCam.MGeomCam.fPixels (1,) float64
MGeomCam.MGeomCam.fMaxRadius (1,) float64
MGeomCam.MGeomCam.fMinRadius (1,) float64
MGeomCam.MGeomCam.fPixRatio (1,) float64
MGeomCam.MGeomCam.fPixRatioSqrt (1,) float64
MGeomCam.MGeomCam.fNumPixInSector (1,) float64
MGeomCam.MGeomCam.fNumPixWithAidx (1,) float64
MSignalCam. (912, 1440) float64
MSignalCam.fNumPixelsSaturatedHiGain (912, 1440) int32
MSignalCam.fNumPixelsSaturatedLoGain (912, 1440) int32
MSignalCam.fPixels.fRing (912, 1440) int16
MSignalCam.fPixels.fPhot (912, 1440) float32
MSignalCam.fPixels.fErrPhot (912, 1440) float32
MSignalCam.fPixels.fArrivalTime (912, 1440) float32
MSignalCam.fPixels.fTimeSlope (912, 1440) float32
MTime. (912,) float64
MTime.fMjd (912,) uint32
MTime.fTime.fMilliSec (912,) float64
MTime.fNanoSec (912,) uint32
MRawEvtHeader. (912,) float64
MRawEvtHeader.fDAQEvtNumber (912,) uint32
MRawEvtHeader.fNumTrigLvl1 (912,) uint32
MRawEvtHeader.fNumTrigLvl2 (912,) uint32
MRawEvtHeader.fTrigPattern (912,) uint32
MRawEvtHeader.fNumLoGainOn (912,) uint16
MSoftwareTrigger. (912,) float64
MSoftwareTrigger.fPatch (912,) int16
MSoftwareTrigger.fBaseline (912,) float64
MSoftwareTrigger.fPosition (912,) uint16
MSoftwareTrigger.fAmplitude (912,) float64

Now the fact that the arrays we read out are inhomogeneous in shape, reflects that they are inhomogeneous in shape in the file we read in.

I think this is how it should be.

mblnk commented 7 years ago

But in this case, there are two trees in the root file:

RunHeaders: 1 entry, so leaves with (1,) Events: 912 entries, so leaves with (912,) and some leafs with (912, 1440)

So for each tree, there is a different number of entries and the possibility to contain pixel information. There are root files with even more than two trees. That is way I want to keep the tree information.

dneise commented 7 years ago

There are a lot of useless entries in the returned dict -- all leaves ending with a "."

I must say, I was surprised to see, that this seems to be not entirely true. If one looks for example at all the leaves with "MTime" in their name (I removed the ugly ROOT Error messages):

In [20]: d = m.tree_file_to_dict('20171022_215_C.root')
    ...: for name, value in d.items():
    ...:     if "MTime" in name:
    ...:         print(name, value[:3])
    ...:         
MTime. [ 58049.26274459  58049.26274647  58049.26274671]
MTime.fMjd [58049 58049 58049]
MTime.fTime.fMilliSec [ 22701132.  22701295.  22701315.]
MTime.fNanoSec [983000  45000 464000]

I have the impression only the leaf called "MTime." (with a dot at the end). contains some useful timestamp ...

dneise commented 7 years ago

That is why I want to keep the tree information.

Sorry, your remark contained a lot of points. I did not yes answer the one about keeping information about tree_names .. I answered the one about different shapes.

dneise commented 7 years ago

The information, which leaf is from which tree is lost

Good point, we should keep that information.

dneise commented 7 years ago

You can now only read everything at once -- If you have several files and only need a few leaves of some tree, this is extremely slow and requires a lot of memory

The point about the memory, is only partly true. One can (and will) do this, if memory is an issue:

my_stuff = [
    read_mars.tree_file_to_dict(path)['leaf_name']
    for path in list_of_paths
]

But the point is, this one will only do this after one was able to read all the "leaf_names" from the file and exploring its contents to find out what is interesting in the file and what not. And only if memory is an issue.

About the time ... this is true. Is it an issue? Or is it only a theoretical issue. How many files are people typically reading? Does the action they perform on the files contents take less time than reading it?

One can address this issue of course by taking this stuff apart into two parts:

keys = read_mars.get_file_keys()
# look at keys and find out what one is interested in 
interesting_keys = keys[1, 5, 17]
my_stuff = [
    read_mars.get_file_values(path, interesting_keys)
    for path in list_of_paths
]

This also has its advantages.

dneise commented 7 years ago

Now you get back a dict with a seemingly endless list of keys with many useless entries, of which you have to make sense in some way or another.

Is this point about the "dict" or about the "many keys"? Further down you write

Because pandas dataframes are way easier to handle and provide a lot more functionality [...]

dneise commented 7 years ago

This makes more sense to me, because for each tree in the file you get back two categories: Information per entry [n_events] or if available, information per pixel [n_events, 1440]

I see ... I hoped to be able to read any leaf from any tree. So that somebody who happens to have a MARS file, maybe not even FACT related has some chance to get at least as much out of the file as possible.

Based on this "be-able-to-read-everything"-approach I hoped anybody could brew their own solution.

Well .. it was a proposal. ... People don't like it ... also fine.

fact-project / read_mars

Tree file thing #11