cms-sw / cms-bot

A few scripts to automate approval / testing process
27 stars 244 forks source link

Add new label for DataFormat breaking PRs #2245

Open francescobrivio opened 3 months ago

francescobrivio commented 3 months ago

From the operational point of view it would be very useful to have a breaks-dataformats (or something similar) label that can be assigned to PRs which involve the change of the RAW data-format or of any of the derived data-formats (AOD, ALCARECO...). This label can be used to easily spot when we a processing-version or an Era update are needed in Tier0 when a new release is being deployed.

I'm not sure if the label creation can/should be automatized somehow by checking if changes are done to the DataFormat package, since I can imagine having PRs that touch this package without necessarily modifying the data-formats (or if there are other packages that might lead to similar changes in data formats).

Alternatively it can be left to L2s to assign the label to a PR when they review it.

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @francescobrivio.

@Dr15Jones, @sextonkennedy, @smuzaffar, @makortel, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 months ago

assign operations, core

cmsbuild commented 3 months ago

New categories assigned: operations,core

@Dr15Jones,@davidlange6,@makortel,@rappoccio,@antoniovilela,@smuzaffar,@fabiocos you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 months ago

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

Assuming the first option, I'd imagine we could easily label automatically all PRs touching DataFormats and IOPool/Streamer, with the ability of removing such label from those PRs that do not change the data layout.

smuzaffar commented 3 months ago

@makortel , is there any way we can check this during PR tests e.g. comparing class versions /checksum ?

makortel commented 3 months ago

is there any way we can check this during PR tests e.g. comparing class versions /checksum ?

I guess theoretically the checksums of all classes defined in classes_def.xml files could be compared. This comparison would have to be different from the edmCheckClassVersion, because that script checks the checksums only for those classes for which a version and checksum are defined in the classes_def.xml, whereas in this check we'd want to compare everything.

francescobrivio commented 3 months ago

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

From the operations point of view, it's definitely more important to cover all such PRs (i.e. some amount of false positives is ok), especially if we give the reviewers the option of removing such label in case it's not needed.

We should avaoid at all cost missing some of the PRs that could potentially break the data layout!

makortel commented 3 months ago

On the automation side, would it be more important to try to cover all such PRs at the expense of some PRs being misclassified (i.e. some amount of false positives would be ok), or to avoid labeling PRs that do not break streamer file reading at the expense of missing some PRs that break (i.e. avoid false positives)?

From the operations point of view, it's definitely more important to cover all such PRs (i.e. some amount of false positives is ok), especially if we give the reviewers the option of removing such label in case it's not needed.

Thanks. Then I'd look into some relatively simple way of labeling all PRs that can potentially break streamer file compatibility, and asking the PR reviewer(s) to remove the label if it is clear the data format itself remains unchanged. Maybe the bot could issue a specific message on that to remind the reviewers?


About the label name, I have mixed feelings about breaks-dataformats. We have other data format packages as well (SimDataFormats etc), whereas the use case seems to be heavily connected to data taking. I'm worried using a general name "data formats" could be easy to misunderstand to cover all data formats, rather than only those related to data taking.

Reading the issue description again I became a bit confused. I recall the discussion in ORP was about streamer file compatibility (which made me think breaks-streamer), but the issue description talks about RAW or AOD etc, i.e. the output of Tier0 (in ROOT files, in which case the "break" can be too strong because in many cases the old files can be read thanks to schema evolution).

What about event content definition changes? Do those need similar special attention in Tier0?

francescobrivio commented 3 months ago

About the label name, I have mixed feelings about breaks-dataformats. We have other data format packages as well (SimDataFormats etc), whereas the use case seems to be heavily connected to data taking. I'm worried using a general name "data formats" could be easy to misunderstand to cover all data formats, rather than only those related to data taking.

Reading the issue description again I became a bit confused. I recall the discussion in ORP was about streamer file compatibility (which made me think breaks-streamer), but the issue description talks about RAW or AOD etc, i.e. the output of Tier0 (in ROOT files, in which case the "break" can be too strong because in many cases the old files can be read thanks to schema evolution).

What about event content definition changes? Do those need similar special attention in Tier0?

Hi Matti, indeed the request I made is more connected to data-taking: I believe we are interested in marking also cases were the RAW or derived data formats are changed because this requires a change of Era or processing version in Tier0. Quoting directly from the ORM Twiki (revision=106):

General guidelines for requesting an era change as opposed to a processing version change :
 - Change of ERA:
      - any change of RAW data-format;
      - [...other conditions not relevant to this discussion...]
 - Change of processing version:
      - update of derived data-formats (AOD, ALCARECO);
      - [...other conditions not relevant to this discussion...]

Additionally, we are definitely interested in marking the PRs that involve a change of the streamers format.

I have no strong opinion on the label name I proposed in the original issue description, any other name that can help identify quickly and easily such PRs is perfectly fine for me, I don't know if other have a different option. (Maybe we could go with a "less strong" changes-dataformats?)

mmusich commented 2 months ago

Hello, what's prognosis for this issue?

smuzaffar commented 1 month ago

Hi, As discussed in ORP and Core SW meeting yesterday, for now I will implement the following

Later we can improve it by actually checking the class versions/checksums in the PR and compare it with IB version/checksums and add this label if a PR class checksum is different than IB class version. I think if there are only newly added persistent classes then there is no need to add this label .... right?

smuzaffar commented 1 month ago

https://github.com/cms-sw/cms-bot/pull/2289 should allow to add the changes-dataformats label by an explicit comment by any L2/ORP

smuzaffar commented 1 month ago

https://github.com/cms-sw/cms-bot/pull/2290 allows to automatically add changes-dataformats label if PR touches any src/classes_def*.xml file. One can remove the automatic label by adding type -dataformat or type -changes-dataformats comment

makortel commented 1 month ago

2290 allows to automatically add changes-dataformats label if PR touches any src/classes_def*.xml file. One can remove the automatic label by adding type -dataformat or type -changes-dataformats comment

After a bit more thought I think the "classes_def.xml is changed" is too eager (i.e. leads to changes-dataformats being set automatically where there are no changes in reality).

First, the check should be restricted to the specific packages that contain persistable data formats: AnalysisDataFormats, CalibFormats, CondFormats, DataFormats, FastSimDataFormats, SimDataFormats, TBDataFormats. Other packages can have classes_def.xml files to define dictionaries e.g. for transient-only data formats, or classes to be used interactively in ROOT.

Second, a classes_def.xml file can be changed in a way that does not correspond to an actual data format change. For example, one could add a newline, reorder entries, add a transient-only data format there, add a transient-only member to a persistable data format class, etc.


Later we can improve it by actually checking the class versions/checksums in the PR and compare it with IB version/checksums and add this label if a PR class checksum is different than IB class version.

I have a work-in-progress prototype for a script to dump the class versions and checksums of all non-transient classes defined in given classes_def.xml file in https://github.com/cms-sw/cmssw/pull/45423. I have some questions there on the further development that depend on how exactly we want to implement this check in the bot (I suggest we continue that discussion in the PR).

makortel commented 1 month ago

I think if there are only newly added persistent classes then there is no need to add this label .... right?

Based on my understanding of the use cases (possible streamer file incompatibility, possible need to change ERA, possible need to change processing version) I'd say also the case where a new persistent classes are added the label should be added (in general).

makortel commented 1 month ago

First, the check should be restricted to the specific packages that contain persistable data formats: AnalysisDataFormats, CalibFormats, CondFormats, DataFormats, FastSimDataFormats, SimDataFormats, TBDataFormats.

I could not I'm not really sure if CalibFormats and CondFormats should be in this list, but I found dictionaries defined for edm::Wrapper<T> instantiations that were not marked as transient, so from the framework point of view these are "persistable" https://github.com/cms-sw/cmssw/blob/e232a9b23f2f9125c636fbd3b70f3920f1ca6970/CondFormats/Common/src/classes_def.xml#L10 https://github.com/cms-sw/cmssw/blob/e232a9b23f2f9125c636fbd3b70f3920f1ca6970/CondFormats/OptAlignObjects/src/classes_def.xml#L33-L34 https://github.com/cms-sw/cmssw/blob/e232a9b23f2f9125c636fbd3b70f3920f1ca6970/CalibFormats/CaloObjects/src/classes_def.xml#L2-L7