Alliance-for-Tropical-Forest-Science / DataHarmonization

Code to run the data harmonization app and support cross-site analysis
https://alliance-for-tropical-forest-science.github.io/DataHarmonization/
3 stars 1 forks source link

robust processing of dates #37

Open gabrielareto opened 1 year ago

gabrielareto commented 1 year ago

Dates seem to be a recurrent problem. There are mixes of formats within tables, and especially between different tables that users want to stack (different censuses). A sub-routine that handles dates in different formats seems a very useful addition.

ValentineHerr commented 1 year ago

between different tables that users want to stack (different censuses).

In this case, the user should use the app to bring their different censuses to the same format, and then stack them. It would be easy once they have their profile, they would just need to change the date format for each census.

There are mixes of formats within tables

I feel that this is on the user end to deal with ahead of time. There are assumptions to be made that are risky to do on our end, e.g for dates with days between 1 and 12 or if the year only uses 2 digits.

gabrielareto commented 1 year ago

We should avoid making assumptions about date formats, I agree.

But we are wrong on our assumption that users have one, and just one, date format. Dates are very inconsistent in the real datasets. Between tables, but also within tables. Many of our users have aggregated datasets compiled during decades by different teams, fieldcrews, etc. I think this is a point that our users won't solve easily by themselves, we should try to offer something more robust.

Is it too difficult to allow the user to pick multiple date formats from a dropdown menu? If we offer options, we should add examples, like "yyyy-mm-dd as in 1999-12-31 for the last day of 1999". If "none", then ask the user to write one or more, e.g. "dd-mm-yy, dd-mm-yyyy".

We should think also how to accommodate things like "01Nov1999".

I do not think it would be too difficult for us to process dates that could have 2 or 3 formats, but I have not paid attention to how that would work in practice. Ambiguous cases would have to look around.

It may be worth to search for existing solutions to these problems, specialized packages that make reasonable guesses, etc. This must have been a problem to a lot of people before.