annakrystalli / seabirddietDB

Seabird Diet (stomach content) Database collected around the British Isles (1933 - 2017)
http://annakrystalli.me/seabirddietDB/
Other
2 stars 0 forks source link

Decimal format of data in year column #20

Closed annakrystalli closed 5 years ago

annakrystalli commented 5 years ago

Hi @ruedinager

Got one last question regarding data in the column year.

At the minute you have dates such as 2012.5. I appreciate you are aiming for the midpoint of the sampling duration but it is generally quite unusual to store year data as decimal numbers. Indeed it is being thrown up as an error during metadata creation because year as a unit for a numeric variable is not accepted (ie it is classed as a Date variable rather than a time variable) so I'm being forced to create an ad hoc unit of measurement for it, which feels kind of wrong.

I had a little think about it and my suggestion would be to instead convert that column to YYYY-MM-DD date format. So to get an entry with start date 2001 and end 2003, the start date would be converted to 2001-01-01, the end date to 2003-12-31 and the midpoint calculated arithmetically (which R handles nicely) and reported in YYYY-MM-DD format, ie 2002-07-02. Note that this would only affect column year. startyear and endyear would remain in the YYYY format.

I know this artificially increases the resolution of the data in that column but we are already doing that by calculating decimals. What do you think? @tomjwebb any thoughts on this?

ruedinager commented 5 years ago

Hi

yes, sorry the studies that only reported values pooled across years where annoying, and some I went back to could simply not specify yearly data - so it had to remain as a range (introducing the columns start and end year). The way I so far dealt with it was then just simply using this decimal solution which you get in some cases (depends on how many years were covered). I appreciate that in your - much better approach - this causes problem. And indeed half years really don't make sense. On a pragmatic approach one possibly could argue that .5 could be rounded up. Over the temporal scale covered by the data for most species ca. 40 years and the relatively small number of studies that applies to (although there is one guillemot study which covered many sites where they can't any more assign what year what colony was done) - so I can't imagine that that rounding makes a difference. Maybe if it can be simply identified from a separate column whether it is a single year study or spans x years (study duration) that might be all the user might need?

Thank you very much Ruedi

annakrystalli commented 5 years ago

Thanks for the fast feedback Ruedi!

OK, yes that sounds like a workable approach. And sure, I can easily create a column (eg multiyear = TRUE or FALSE) for easy filtering and clear flagging. I'll go for that then.