JeffersonLab / hallc_replay

Replay directory for Hall C hcana
0 stars 59 forks source link

[WIP] Switch the submodule dependencies #471

Closed whit2333 closed 6 years ago

whit2333 commented 6 years ago

Git submodules are best used for dependencies (if at all). The UTIL submodules of hallc_replay should rather have hallc_replay as submodules of the UTIL. If things continue as they are, there will be an ever increasing number of "UTIL" submodules, which is certainly not a good thing.

I propose that the UTILs be shed as submodules and the UTIL adopt hallc_replay as a submodule.

This way the experiments can pull the latest hallc_replay into their own "UTIL" (which should probably just be called something else at this point). If the "UTIL" are not experiment specific then they should be pushed into hallc_replay otherwise they are added to a separate repo with hallc_replay added as a submodule.

What are your thoughts on this?

MarkKJones commented 6 years ago

Nothing in the analysis code, hcana, depends on having a specific directory structure for the parameter files, dbase file etc. I think everyhing can be defined in the script or in a parameter.

We probably should put the CALIBRATION directory into its own github repo.

Right now there are two models. The first is you could basically clone hallc_replay, then make it a new git repo called hallc_replay_experiment and then in the future just cherry pick any commits for other repos. The other is to make a submodule that work inside the hallc_replay structure. Either way we have hallc_replay for each experiment or UTIL submodule for each experiment. I do not really see the difference.

The hallc_replay would be a record of the detector, crate and scaler maps along with the reconstruction files. It would also be a record of parameters that do not change much like geometry. It would be a record of recommended starting parameters, script and report templates.

whit2333 commented 6 years ago

Hi Mark,

I am suggesting something like your first option.

Check this out. https://www.atlassian.com/blog/git/git-submodules-workflows-tips

When a component or subproject is changing too fast or upcoming changes will break the API, you can lock the code to a specific commit for your own safety

So in my view it is better to have hallc_replay as a submodule fixed to a commit. It is a far less restrictive mode of development.

If it is the other way around, any development to hallc_replay is bogged down by having to not break the UTILs.

EDIT:

Here is a test repo forked from UTIL_SIDIS and renamed EXPERIMENT_replay. https://github.com/whit2333/EXPERIMENT_replay It uses a commit on this branch https://github.com/whit2333/hallc_replay/tree/no_submodules

We probably should put the CALIBRATION directory into its own github repo.

Yes. I agree. This should also be a submodule of EXPERIMENT_replay. Essentially EXPERIMENT_replay will be an example of how an experiment should setup their replay and how to use submodules to fix to commits to upstream repos. This also isolates components which can be updated and pushed upstream, such as CALIBRATION and the DBASE settings.

whit2333 commented 6 years ago

@MarkKJones Here is an example of what I had in mind: https://github.com/whit2333/EXPERIMENT_replay

It could serve as a "new experiment replay template" (which should always be synchronized with the current running experiment).

pooser commented 6 years ago

The original idea behind configuring submodules which are experiment specific is so that we were able to provide totally autonomy to the experiments on the floor in both the online and off-line environment without having to interact with hallc-replay in any significant way. Utilizing this workflow, also prevents the gate keepers of hallc-replay from having to constantly track changes in hallc-replay in the online environment.

The 'EXPERIMENT_replay' repo concept is indeed a viable workflow. This is precisely what I configured for the F2/XEM offline analysis. Any time significant changes are made in hallc-replay I simply checkout those files or cherry pick an entire commit. This workflow works quite well in my opinion. I think this particular workflow is ideal since this also reduces the workload on the hallc-replay gate keepers.

If various run groups do not wish to maintain their own 'EXPERIMENT_replay' repo, I would argue that we maintain the 'UTIL_EXPERIMENT' workflow. This workflow proved to work well during the fall and spring commissioning runs for exactly the aforementioned reasons. In-short, the current philosophy is that how each run group interacts with the both the online and offline environment via hallc-replay is entirely up to them however, any changes that are not generally applicable to everyone must be configured and maintained in their own privately maintained repository. Moreover, it is the run groups responsibility to have a workflow prepared prior to hitting the floor, and that they actively maintain the repo both in the online and offline environment.

whit2333 commented 6 years ago

Utilizing this workflow, also prevents the gate keepers of hallc-replay from having to constantly track changes in hallc-replay in the online environment.

Neither suffer from this. But the current method suffers from not being maintainable N years from now.

Any time significant changes are made in hallc-replay I simply checkout those files or cherry pick an entire commit. This workflow works quite well in my opinion. I think this particular workflow is ideal since this also reduces the workload on the hallc-replay gate keepers.

Having EXPERIMENT_replay for each experiment allows the (now appropriate submodules) to be maintained but fixed to a commit. See the readme here https://github.com/whit2333/EXPERIMENT_replay.

The problem I see is that having submodules for each experiment is backwards. Each experiment should have the standard files as a submodule. In 5 years how many UTIL_X submodules will be accumulated? When you pull the replay and want to replay you now have to go back and figure out which tag was used and make sure your using the correct UTIL. Again, all of this is because the use of submodules is backwards.

If various run groups do not wish to maintain their own 'EXPERIMENT_replay' repo, I would argue that we maintain the 'UTIL_EXPERIMENT' workflow. This workflow proved to work well during the fall and spring commissioning runs for exactly the aforementioned reasons.

The EXPERIMENT_replay should be maintained for the current experiment. The experiment should then tag and fork when completed. I say fork because there will be further offline replay development (some of which can be pushed upstream). There is no maintenance to be done by the HallC staff after the experiment is run. It does allow, say for example if the cherenkov mirrors break 10 years from now :) , the tag or fork to be checked out and the replay to be exactly as it was at the time!

In-short, the current philosophy is that how each run group interacts with the both the online and offline environment via hallc-replay is entirely up to them however, any changes that are not generally applicable to everyone must be configured and maintained in their own privately maintained repository.

Of course.

Moreover, it is the run groups responsibility to have a workflow prepared prior to hitting the floor, and that they actively maintain the repo both in the online and offline environment.

This is up for debate but there is no reason to not have a standard online workflow setup for everyone to start with. Furthermore, having this standard workflow be a good example is very important and using submodules the wrong way, I argue, is not a good starting point.

Cheers, Whit

whit2333 commented 6 years ago

@sawatzky, @gaskelld suggested I bring this issue to your attention. I think the sooner this is fixed the better. I added a script to make the symbolic links need for a hallc_replay-like directory https://github.com/whit2333/EXPERIMENT_replay

MarkKJones commented 6 years ago

I have to think about it. I do not really understand about making hallc_replay a submodule. I have to think about what needs to be tracked for all experiment like online_GUI, geometry parameters, crate/scaler/detector maps, optics files, calibration code and what is experiment specific like kinematic, database, calibration parameters, flag parameters, script, test/histogram definition files, report files. Maybe make separate submodules for each: online_GUI; calibrations; geometry parameters; crate/scaler/detector maps; optics files and calibration codes separately.

brash99 commented 6 years ago

Perhaps we could have a Thursday meeting (on an off week from the regular analysis meeting) to discuss this (although I certainly am willing to take Mark’s opinions as the guiding principle).

Cheers, E.

Dr. Edward J. Brash Department of Physics, Computer Science & Engineering Christopher Newport University work: (757) 594-7451 cell: (757) 753-2831 www.cnu.edu/pcs

On Jul 27, 2018, at 6:06 PM, Mark K Jones notifications@github.com wrote:

I have to think about it. I do not really understand about making hallc_replay a submodule. I have to think about what needs to be tracked for all experiment like online_GUI, geometry parameters, crate/scaler/detector maps, optics files, calibration code and what is experiment specific like kinematic, database, calibration parameters, flag parameters, script, test/histogram definition files, report files. Maybe make separate submodules for each: online_GUI; calibrations; geometry parameters; crate/scaler/detector maps; optics files and calibration codes separately.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

whit2333 commented 6 years ago

@MarkKJones

Maybe make separate submodules for each: online_GUI; calibrations; geometry parameters; crate/scaler/detector maps; optics files and calibration codes separately.

Yes! This segmentation is sort of the logical end of flipping the submodule relationship. I called this test repo EXPERIMENT_replay to differentiate it from hallc_replay, but conceptually they are the same thing.

Essentially what you are suggesting is to fork hallc_replay N times, once for each independent directory (eg removing everything but the CALIBRATION directory like I did here )

               |------> TEMPLATES 
               |------> DBASE
hallc_replay   |------> CALIBRATION  
               |------> PARAM 
               |------> DATFILES 
               |------> PARAM 
               |------> ...

I think it would be premature to do this for every directory. It would be better to keep them in a larger "hallc_replay" submodule repo until, for example, PARAM shouldn't track along with the rest of hallc_replay. At that point you can simply fork hallc_replay and have a new hallc_PARAM submodule repo.

This is how git submodules were designed to be used. It might not be obvious but because hallc_CALIBRATION has hallc_replay the same pre-fork commit history, any new calibrations developed by a pre-fork experiment can still be pushed upstream to hallc_CALIBRATION fork. Thus it is a very natural processes.

MarkKJones commented 6 years ago

Sorry for my slow response. I talked with a few people. My feeling is that we should keep the present hallc_replay directory with submodules for each experiment. The hallc_replay is only for online replay and the submodules are relatively small directories with contain script and utilities that the experiment wants the online shift crew to perform. The offline replay directory can be whatever the experiment whats. It could be a spin-off ( fork?) of hallc_replay.

whit2333 commented 6 years ago

Hi @MarkKJones,

The hallc_replay is only for online replay and the submodules are relatively small directories with contain script and utilities that the experiment wants the online shift crew to perform

Fair enough. But why bother with submodules at all?

IMO it is better to not use submodules (generally). In this case, as time goes on it will make life worse because the submodule dependency chain is inverted (backwards).

I'll close this issue now.