Alliance-for-Tropical-Forest-Science / DataHarmonization

Code to run the data harmonization app and support cross-site analysis
https://alliance-for-tropical-forest-science.github.io/DataHarmonization/
3 stars 1 forks source link

is stacking and merging and tidyng in the profile? #19

Closed gabrielareto closed 1 year ago

gabrielareto commented 1 year ago

These three steps should be included in the profile if we want to ensure reproducibility of the whole flow. This is important, I see users behaving a little bit erratically also during these three steps.

I assume these steps are not included in the profile, because the profile is uploaded later.

Why don't we include these steps in the profile?

ValentineHerr commented 1 year ago

I think this will be tricky and may cause a lot of glitches.... This would require the user to always have the same set of tables and name them exactly the same each time they use the app... We could try to help with that (based on the profile, if it is uploaded first), but I think that restricts the use if the profile by the user, if, for example, they want to add a table, or if they want to use it on another dataset which may not be composed of the exact same set of tables....

I don't think it is worth getting into this and all the unexpected consequences that it may cause. I believe most users know (or at least learn by using the app) what they need to stack, merge and tidy, and once they've done it a couple times it goes really fast.

gabrielareto commented 1 year ago

thank you for the input. This requires further conversation / thought.

there is one scenario in which a user interacts with the app for some hours or days until they complete the whole process. In that scenario, it is not a big deal if the whole process is not encoded in the profile. I am thinking on a different scenario: Team A works continuously on species identification or whatever, and passes an updated version of their database to Team B or larger network every year or every two years or every time someone wants to do a collaborative paper. Our users do not do the same from day to day, we cannot expect them to do the same from year to year.

selecting keys is surprisingly difficult for many users -- but many users are not the owners of the data. The datasets were given to them. Selecting repeated column names can help them (closed issue #11 ), but storing keys in the profile should be even stronger.

table names should not be a critical limitation, we can enforce "Table1", "Table2", etc. Would order matter, if keys are included in the profile? It will do in stacking -- the id assigned to each chunk of data will change if the order in which tables are uploaded changes (this may or may not be a problem). Order should not matter in tidying or merging, I think.

a profile-that-does-everything is different than a profile-that-maps-columns. Yes, it would require the same number of tables. Why is that a problem? The set of tables is a fundamental aspect of a database, as fundamental as the column names and units within the individual tables. If one uploads the profile upfront, it can inform the user about the tables that were uploaded at the time the profiles was created.

I don't see a reason why the user cannot over-write the number of tables or anything in that input profile, the same way they can over-write the mapping of columns into the standard columns. For example, the user uploads the profile upfront, right at the start, and the app could read it and prompt:

This profile contains instructions to stack and merge 3 tables. At the time of the creation of the profile, these tables were:

  • a table with 12093 rows with these columns names: "stem ID", "date", "dbh-corrected", ...
  • a table with 293 rows with these column names: "plot ID", "responsible-person", ...
  • a table with 44 rows with these column names: [etc]

or, if we do not enforce the table names, it could be something like:

  • a table with 12093 rows that was named "table_stem_info" with these columns names: "stem ID", "date", "dbh-corrected", ... etc.

but the user could still declare "I have 4 tables" instead of 3, and then upload the extra table, over-write the keys if for some reason they updated their columns, or change table names, etc.

table names, keys, etc. will be in the same steps as now, but already pre-populated from the profile. It would be the same approach as we have now, and everything could be over-written if needed. There would be multiple steps where the user has to review and say "do this", then "do this", etc., it would not be just one button "do everything".

I do not see how this could imply fundamental changes in the way the app works. It would be mostly a matter of how information is stored in the profile, and where in the process that info is used to pre-populate things that we are asking the user to declare.

sorry for the long message, I though more while I was writing it.

please share your thoughts or let me know if this could be better discussed in a call.

ValentineHerr commented 1 year ago

Order should not matter in tidying or merging, I think.

The order does matter in merging, if the species and stem tables are swapped you won't have the same output.

I do not see how this could imply fundamental changes in the way the app works.

My main concern is to be able to anticipate any situation, e.g. if the user do not want to upload the same number of tables, or if the columns names in the tables are slightly different than the names they had when they created the profile, causing issues when prepopulating the merging and tidying sections etc....

I am reluctant to introduce more places for bugs. It would help to have existing examples of people needing that feature.

Team A works continuously on species identification or whatever, and passes an updated version of their database to Team B or larger network every year or every two years or every time someone wants to do a collaborative paper. Our users do not do the same from day to day, we cannot expect them to do the same from year to year.

Even in that hypothetical scenario, I do not see a problem if the user does not do exactly the same thing year to year. Each collaborative paper is different and works on a different set of data, The users even have to use the app differently to meet the requirement of the new paper (e.g. maybe they didn't deal with tree codes before but now they have to). Plus the data associated to a paper will be different, (new data + new species identification), and should be provided with the paper.

My overall opinion is: The app is not a data management system. It is a tool for two+ teams to bring their data together for a particular project, at one point in time. The interactions with the app are not static and a user my need to select different things for each new project. The profiles simply help the most tedious part of naming the columns.

gabrielareto commented 1 year ago

I agree in general. The app is not a data management system, it should be used for specific projects, everything can change at each new data federation effort. Whether this fits or not in a culture of data centralization (and power centralization) is still to be seen.

I think pre-populating as I described above is a general way to remind the user what they did the last time. This should help with manual reproducibility. And I agree, reproducibility is best if manual, in this particular application, because things can change for every project or collaboration. The same way the variable names are pre-populated and can be changed.

But, if you think this could cause bugs or more work than benefit, let's leave it as it is and wait for more specific input from the users.