RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

Dataset interface to specify data to read-in is inflexible #7

Closed fschlueter closed 5 months ago

fschlueter commented 1 year ago

Hi @cozzyd,

the title of the issue is more my feeling/opinion. The interface should reflect what "use cases" we will have and they probably change/extend when mattak is integrated into NuRadioMC. So, some brainstorming might be useful. What could be the use cases and how do we want to serve those.

Let me summarise what I think the current status is (so you can correct me if necessary) and make some suggestion what to extend.

Right now you can specify a station (id), run (id) and directory (optional). If station and run are not zero the directory is considered a root directory and the files with in directory/station/run are read-in. If station and run zero, mattak looks for waveforms.root or combined.root (in this order) in the specified directory (not in any subdirectory).

So, what is missing is the possibility to, at once, read in several stations/runs. Do you think that can be easily implemented in mattak or should that be implemented in the NuRadioMC module?

cozzyd commented 1 year ago

If you feel this is an important feature (as opposed to the consumer looping over datasets), then it can be implemented straightforwardly. Do you have a suggestion for a preferred interface for this? One option would be to define a MultiDataset( ) that accepts either tuples of run/event numbers or directories to look in, or should we further overload Dataset?

The intention eventually is to have a stable API surface here.

fschlueter commented 1 year ago

I think it would be an important feature! If we just use mattak in one or two NuRadioMC modules we could loop there over datasets, however, if we want to enable any user to quickly process some data via mattak we should make it as convenient as possible.

What comes to my mind right now would be the following: You specify a directory and Mattak parses through all subdirectories and gets all events in those. Additional what would be nice to have is to filter for a specific station id (and maybe even specify a range of runs although that might be unnecessary).

fschlueter commented 1 year ago

And I actually would overload Dataset if possible.

fschlueter commented 1 year ago

Another idea: we could allow to directly pass a list of file names or/and specify which file to take in a given directory.

cozzyd commented 1 year ago

The problem with this is that full datasets require multiples files passed per run (since generating a combined file for full datasets would take twice the space for no reason).

On 2/27/23 11:30, Felix Schlüter wrote:

Another idea: we could allow to directly pass a list of file names or/and specify which file to take in a given directory.

— Reply to this email directly, view it on GitHub https://github.com/RNO-G/mattak/issues/7#issuecomment-1446747817, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGHWKJA4SN5ZENG3SXSSPDWZTQDZANCNFSM6AAAAAAVGAYTQU. You are receiving this because you were mentioned.Message ID: @.***>

fschlueter commented 1 year ago

I see, but we have these combined files which are "standalone", correct? And I guess that would also allow to read files like the one created by zack. But I guess if it is to inversive it is not worth it.

cozzyd commented 1 year ago

So one thing that can clean this up which I've already implemented but not merged is that if you pass a file instead of a directory to data_dir, it could treat it as a combined.root file (since it's easy to check if a path is a file or a directory). Would that help?

shallmann commented 1 year ago

I think if a .root file is passed explicitly it is fine to treat it as combined.root ... if they are not combined one might anyway want to store individual files in a directory (or have access to a cluster where the data is stored in the usual directory structure) I would go for that option

fschlueter commented 1 year ago

Alright, let me see if I can brew something together

cozzyd commented 1 year ago

This branch attempted to do this, among a few other things : https://github.com/rno-g/mattak/tree/add-dataset-opts

It allows treating directory as a file and also setting a preferred file name (like if you has a Felix.root in each directory you wanted to load as a combined fire). Looks like will require remerging. I'm going to be on a plane most of the day today so probably won't get to it today.

On April 28, 2023 7:16:13 AM CDT, "Felix Schlüter" @.***> wrote:

Alright, let me see if I can brew something together

-- Reply to this email directly or view it on GitHub: https://github.com/RNO-G/mattak/issues/7#issuecomment-1527479152 You are receiving this because you were mentioned.

Message ID: @.***>

fschlueter commented 1 year ago

Alright, should I have a look than?

PS: I why not every station should write their own files :)

fschlueter commented 10 months ago

As a reminder for us. Once the interface is improved to take file paths we can remove the dirty hack in NuRadioReco implemented with this PR https://github.com/nu-radio/NuRadioMC/pull/567

fschlueter commented 5 months ago

This issue can probably be closed now after #25 was merged. I will reopen it if it turns out we want an even more flexible interface