XENONnT / straxen

Streaming analysis for XENON
BSD 3-Clause "New" or "Revised" License
20 stars 32 forks source link

Try to make hashing more coinsistent #1201

Closed LuisSanchez25 closed 1 year ago

LuisSanchez25 commented 1 year ago

Before you submit this PR: make sure to put all operations-related information in a wiki-note, a PR should be about code and is publicly accessible

What does the code in this PR do / what does it improve?

It aims to ensure URLConfigs with the same strings but in different order gives you the same hash by alphabetizing the arguments as a preprocess for the URLConfig class.

Can you briefly describe how it works?

Before the URL is hashed, a preprocessing function for the URLConfig class is run to alphabetize the order of the queries in the URL

Can you give a minimal working example (or illustrate with a figure)?

This is also included on the tests: st1 = xenonnt_online() st2 = xenonnt_online()

Same URLs but the queries are in a different order

st1.apply_xedocs_configs(version = 'v_test1', db='development_db') st2.apply_xedocs_configs(version = 'v_test2', db='development_db') assert st1.key_for(25000, 'corrected_areas').lineage_hash == st2.key_for(25000, 'corrected_areas').lineage_hash

Please include the following if applicable:

Notes on testing

All italic comments can be removed from this template.

LuisSanchez25 commented 1 year ago

Hey Yossi, I made the code with the preprocessor decorator (altough I dont think this was actually necessary since there was already inbuilt code to the URLConfig class to handle this). I tried following he code on how context configs are set and at no point that I could find was the class URLConfig ever called. I dealt with the issue in the apply_xedocs_config by including a function that reorders the URLs before calling the set_config function, at which point I believe the hash is made.

coveralls commented 1 year ago

Coverage Status

coverage: 93.522% (+0.006%) from 93.516% when pulling a82a20749f0b463f475a9609fe3d3a00b26a6ab0 on coinsistent_hash into 8da4b6fe28c3bcbbee6a68b11fd411cb98e95ace on master.

WenzDaniel commented 1 year ago

@LuisSanchez25 short question how does then the URL config in the lineage look like?

st.key_for(run_id, data_type).lineage

Will it be alphabetically sported or as provided as config to the plugin?

jmosbacher commented 1 year ago

@WenzDaniel The preprocessors are run by URLConfig at the plugin initialization stage before setting plugin.config, so the URL in the lineage will always be sorted.

WenzDaniel commented 1 year ago

I am afraid I have not followed the issue enough, but I am worried a bit that this could lead to confusion. What would happen if I pass the sorted config string again as a config in straxen? Would this work? Will we confuse analysts with having a different config in the lineage with respect to the plugin code?

jmosbacher commented 1 year ago

Right, but on the other hand this is nothing new right? The context already performs modifications on the config of the plugin prior to setting it, such as using the default value if it's not set or taking the value from a dict based on the run I'd if the defaults by run I'd is enabled. This just adds to that logic... The change to the url is rather minor since it just changes the order of the parameters at the end of the url. That being said I was not a strong advocate of making all equivalent urls resolve to the same hash, I just mentioned that is easy to implement using the preprocessor framework. But it did seem to be the preferred option by most in the team a meeting we had about it

WenzDaniel commented 1 year ago

Right, but on the other hand this is nothing new right? The context already performs modifications on the config of the plugin prior to setting it, such as using the default value if it's not set or taking the value from a dict based on the run I'd if the defaults by run I'd is enabled. This just adds to that logic...

I see, my impression is just that this is a bit more transparent, but also might be just personal bias.

That being said I was not a strong advocate of making all equivalent urls resolve to the same hash, I just mentioned that is easy to implement using the preprocessor framework. But it did seem to be the preferred option by most in the team a meeting we had about it

I also think that it makes a lot of sense and that the hash should depend on the data and not the order of the config. I am just wondering if this needs more communication with the users. E.g. force all default settings to be sorted, and warn the user in case a non-sorted setting is used, that it is going to be sorted in the hash. But might also be over engineering.

If people are happy as it is I am also fine, I have no strong opinions either way.

jmosbacher commented 1 year ago

A warning would be nice but it would probably have to be a single global warning otherwise we will have tens of warnings every time someone processes any data. This makes the implementation of the warning a bit more complex.

LuisSanchez25 commented 1 year ago

I think we can put a warning for maybe this version informing people about this change and to use older version of straxen if the want to access data they already processed with their URLs. We can then remove the warning in a version or two from now when people become familiar with this. Can it be a warning that just shows up when you first import straxen?

LuisSanchez25 commented 1 year ago

@jmosbacher @WenzDaniel I added a warning to the init file which should only be displayed when you first import the straxen file. Let me know if I should change the message in any way or if there are any concerns with the way I implemented it.

WenzDaniel commented 1 year ago

Thanks @LuisSanchez25. Two little things I doubt that the next release will be 2.0.8, but more like 2.1.0. I think there were quite some changes to reason a major bump, but maybe @dachengx can comment. Maybe add a final sentence like, that nothing changes in the usage of URL configs or how one overwrites a default setting.

dachengx commented 1 year ago

Thanks @LuisSanchez25. Two little things I doubt that the next release will be 2.0.8, but more like 2.1.0. I think there were quite some changes to reason a major bump, but maybe @dachengx can comment. Maybe add a final sentence like, that nothing changes in the usage of URL configs or how one overwrites a default setting.

Thanks, @LuisSanchez25 . I agree that the next release should be at least 2.1.0

jmosbacher commented 1 year ago

Can you only raise a warning if a URL was affected by the preprocessor? So can we move the warning logic to the preprocessor function and raise the error if both the URL was changed and warning has not yet been raised

LuisSanchez25 commented 1 year ago

From my understanding the warning function will only appear once when it is called, I have checked this by trying to invoke it multiple times so this setup should work. I believe this is ready to merge.

jmosbacher commented 1 year ago

Can we change this to:

From straxen version 2.1.0 onward, URLConfig parameters will be sorted alphabetically before being passed to the plugins, this will change the lineage hash for non-sorted URLs. To load data processed with non-sorted URLs, you will need to use an older version.
LuisSanchez25 commented 1 year ago

Ah yes you are correct, I will change it to using the testing context

dachengx commented 1 year ago

Hey, @jmosbacher. Is this good enough? It looks good to me, especially the URLConfig.preprocessor and warning.

jmosbacher commented 1 year ago

@dachengx there are still two open changes i requested:

dachengx commented 1 year ago

@LuisSanchez25 For the 2nd request, I guess you can try https://github.com/XENONnT/straxen/blob/8da4b6fe28c3bcbbee6a68b11fd411cb98e95ace/tests/plugins/_core.py#L68.

jmosbacher commented 1 year ago

@LuisSanchez25 please try to run the tests locally before pushing to github. We only have 3000 cpu-minutes every month of github actions until it starts costing money and these tests can take a lot of resources

dachengx commented 1 year ago

@LuisSanchez25 @jmosbacher Thanks!