Alliance-for-Tropical-Forest-Science / DataHarmonization

Code to run the data harmonization app and support cross-site analysis
https://alliance-for-tropical-forest-science.github.io/DataHarmonization/
3 stars 1 forks source link

reproducible corrections #43

Open gabrielareto opened 1 year ago

gabrielareto commented 1 year ago

Corrections have many options and it will be difficult (impossible) for two teams to run the same corrections.

Communication between different teams that want to aggregate data is, and will be, extremely limited. There is basically no mutual understanding of the datasets. People running the app share the output data, and not the profile, etc. etc. etc. etc. etc. etc. etc. etc. etc.

Anything that is not in the profile is a vulnerability in data federation, and for data corrections in particular. (Because stacking and merging have, in theory, objective rules, while the parameters in these corrections are subjective).

The good practice is to do merging of datasets and then using the app for corrections. But many of our warning texts go unnoticed by users. We need to enforce good practices beyond warning messages.

It may make sense to split the app in two, and add more detail in the corrections, because it is difficult to understand all those parameters on the fly.

gabrielareto commented 1 year ago

I guess the take home message is that everything that the user does in the app should be stored in the profile, so the whole thing is fully reproducible using the profile and no other input.

ValentineHerr commented 1 year ago

This is where saving two profiles in the downloaded zip gets tricky...

I think that the correction info should be saved in the input profile. This is where eveything the user selected is stored and this is what he will give a colleague to use as output profile.

The problem is now we are saving the input profile and the output profile (mostly to accommodate the situation where the output profile is the app's standard, even though that is technically unnecessary since the user can use the app's profile as input profile). Now the user may be confused about what profile to share ad now I am confused about where to store the info about the corrections.

refering to isse #46

ValentineHerr commented 1 year ago

Also, another issue is that pre-populating may not work great, if, for example, the user selected the pioneer species that don't exist in the other user's data..

How about we simply save a .csv file with what the parameters of the selected function were?

ValentineHerr commented 1 year ago

How about we simply save a .csv file with what the parameters of the selected function were?

The problem with that is that we won't be able to pre-populate the app with what the output profile says in term of corrections... But the csv file should be plenty to know what to select.

ValentineHerr commented 1 year ago

Now the user may be confused about what profile to share

I think that is a more important concern

gabrielareto commented 1 year ago

Yes, some parts of this problem are confusing... I think it is because corrections refer to the process of [going from A to B] without the use of a stepping stone.

For the moment, let's respect and protect the idea that the profile of a given dataset X is a form of metadata: it does the work of [describing the dataset X] which is equal to [describe the dataset X in terms of the app's language and conventions] or [how to go from X to the app's central standard]. Whether these are inputs or outputs should not matter. Are there inconsistencies now between this idea and the current functioning in the app?

Correction parameters do not fit in this type of profile. For the moment, store parameters for corrections in a csv file. This is the minimum that we can do to help with the communication needed.

We will need to talk with @cpiponiot, at least, about how corrections fit in this app:

This could be a broader Zoom call with TmFO and RBA, who are working on within-network federation.

gabrielareto commented 1 year ago

Conceptually, we can imagine an app's output like this:

(I am using from/to instead of raw/processed to avoid confusion, but I do not think these are the best words, I hope you understand what I mean)

I do not claim this would be the best output for our users. It may or may not be too complicated. But it seems that corrected vs. not corrected and format A vs. format B are two independent axes of data transformation.

I think we need to discuss these points before actually trying to incorporate the reproducibility of corrections.