Open dneise opened 7 years ago
Related to issue #8
Thanks for implementing the read function so fast.
The function worked on several callisto files, but it really is painfully slow so I just applied the V1-trick again.
Now it takes only 123 ms instead of 106 s to read 499 Events.
That is awesome! I tried to play around with the Draw command but could not get it to work.
How did you manage to guess(?) these Draw strings?
I would vote to merge, but these things still might be added.
But if we can streamline an interface, we might wanna do it now.
Concerning the "Draw-strings": I just had a look into a callisto file with the root TBrowser() and found those leafs. As it is possible to show the histogram of both leafs in the TBrowser, it follows, that the Draw function can handle them. As the number of entries is exactly 1440 times the number of events, I guessed, that V1 contains an ordered vector with all pixels of all events. Reshaping the returned array into slices of length 1440 and comparing the values to the ones of the implementation without V1 confirmed the guess. I also had a short look into "MSignalPix.cc" of MARS - but that was not too helpful.
About other things in a callisto file:
There is another single pixel information: a " time variable fTimeSlope describing the width of the rise time". To get that, one would need to add it to the dictionary of the fields with "MSignalCam.fPixels.fTimeSlope".
As one can also see in the screenshot attached to the previous post, there are also some "typical" branches like MPointingPos and MTime. One might need those, but they can already be read with the read_mars() function.
It might be useful to add a check to read_mars, that automatically skips leafs containing MSignalCam and prints a notification. The notification might say something like: "This root file contains branches with MSignalCam. To read the pixel/image information, try read_callisto()."
Concerning the data type:
I don't know where it happens, that the data type always ends up as float64, but I think that GetV1 is always double.
Aside from that, the datatype of a leaf can apparently be checked:
file = ROOT.TFile(filename)
tree = file.Get('Events')
leaf = tree.GetLeaf("MSignalCam.fPixels.fArrivalTime")
leaf.GetTypeName()
Do we need/want to do something with that information?
We could convert the float64 back to the type of the leaf.
This V1 stuff casts all data to float64, as the Histogram's Fill functions takes a double.
I just added the casting to the original datatypes. Can the changed datatypes of the pandas dataframe returned by read_mars lead to problems? If no problems are expected, I would vote to merge.
No need to duplicate the default values in the function doc string and in the function definition. In fact, this duplication can quickly become stale, when somebody changes the default but forgets to update the doc-string.
ordering pixels along CHID, is the only ordering which makes sense in our python based eco system. We expect this ordering to be the case and want to only talk about the ordering, in case it is not sane.
So we remove the remark about the CHID ordering.
What happens, if somebody makes a mistake ... say they want to specify more fields, but make a typo:
fields={
'charge': 'MSignalCam.fPixels.fPhot',
'arrival_time': 'MFoo.fBar.fBaz'
}
does the function crash in a nice way, so they can try again? or do we get a segFault?
At some point N
was renamed to n_events
... but one instance was forgotten...
There are two unit tests, but they did not find this bug. So there are clearly tests missing
Oh, I just see the tests did not find the error because they seg fault ... at least on my machine ... that is not so good...
The "V1" trick was already implemented in the function names leaves_to_numpy
.
basically read_callisto is just a simple call to leaves_to_numpy
, when redifining the fields
to be a list and not a dict of {name: getter, name2: getter2}, but just list [getter, getter2]
see commit right above
when looking at read_mars
and read_callisto
, they are actually just providing some leaf_names.
the former by iterating over all leaf_names and filtering the invalids away, the latter simply has a default list of leaf_names that are difficult to remember ...
there must be a better way!
Both functions do now throw errors.
read_callisto throws
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-a7fba0ecd8c5> in <module>()
----> 1 read_callisto("/media/michi/523E69793E69574F/gamma/callisto/gamma_gustav_12/10917/00010917.917_D_MonteCarlo010_Events.fits.gz_callisto.root")
~/anaconda3/lib/python3.6/site-packages/read_mars-0.0.5-py3.6.egg/read_mars/__init__.py in read_callisto(filename, tree, fields)
126 'MSignalCam.fPixels.fTimeSlope'
127 """
--> 128 return leaves_to_numpy(ROOT.TFile(filename).Get(tree), fields, N=1440)
~/anaconda3/lib/python3.6/site-packages/read_mars-0.0.5-py3.6.egg/read_mars/__init__.py in leaves_to_numpy(tree, leaf_names, N)
59 # See eg. https://root.cern.ch/root/roottalk/roottalk03/0638.html
60
---> 61 n_events = tree.GetEntries()
62 tree.SetEstimate(n_events + 1) # necessary for files with more than 1 M events
63 out = {}
AttributeError: 'PyROOT_NoneType' object has no attribute 'GetEntries'
and read_mars throws:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-587957623746> in <module>()
----> 1 read_mars("/media/michi/523E69793E69574F/gamma/callisto/gamma_gustav_12/10917/00010917.917_D_MonteCarlo010_Events.fits.gz_callisto.root")
~/anaconda3/lib/python3.6/site-packages/read_mars-0.0.5-py3.6.egg/read_mars/__init__.py in read_mars(filename, tree, leaf_names)
106 ]
107
--> 108 df = pd.DataFrame(leaves_to_numpy(tree, leaf_names))
109 file.Close()
110
~/anaconda3/lib/python3.6/site-packages/read_mars-0.0.5-py3.6.egg/read_mars/__init__.py in leaves_to_numpy(tree, leaf_names, N)
69 tree.Draw(leaf_name, "", "goff")
70 v1 = tree.GetV1()
---> 71 v1.SetSize(n_events * tree_v1_dtype.itemize) # a double has 8 Bytes
72 out[leaf_name] = np.frombuffer(v1.tobytes(), dtype=tree_v1_dtype)
73
AttributeError: 'numpy.dtype' object has no attribute 'itemize'
`
do they work for you?
for read_callisto: the multiplication times 1440 in the SetEstimate and SetSize functions is important:
tree.SetEstimate(n_events 1440) and v1.SetSize(n_events 8 * 1440)
otherwise, some values are omitted or the function just fails
Now it works again on my machine. There was a typo, itemize should be itemsize.
When supplying a typo in the fields list of read_callisto, one gets no segfault but a rather cryptic error: ValueError: PyMemoryView_FromBuffer(): info->buf must not be NULL
Nope it was also not working on my machine ... I developed without testing .. and you saw the result. then I had to go ... and you took over ...
I was still developing .. but you didn't know that ...
In the end you spent a lot of time reparing something, which I do not like at all anymore. well ... maybe you like it ... I don't know.
Anyway, please have a look at #11
There is a single test, showing the interface ...
Work in progress
this function reads charge and arrival time from a
_C.root
file coming out of callisto. It is extremely slow ~900 events in 140 sec.I don't know how to speed the stuff up. But maybe you wanna have a look.