Problem with ISRaD_Extra: lyr_14c, lyr_14c_SD, lyr_14c_sigma (and maybe others)

coreylawrence commented 4 years ago

It has come to my attention that there are some problems in the most recent version of the ISRaD extra data (v 1.3.4.2020-03-09) and maybe also in earlier versions. In my case, I am loading these data from the R-package. If my observations are correct, I would characterize this as an Urgent Issue.

First sign of trouble. When I load the ISRaD_extra version of the dataset and tally the non-NA values in the lyr_14c, lyr_fraction_modern, and lyr_14c_fill_extra columns, I get the same value of 4720 for all of those columns. I assume this number should be correct for the "fill_extra" column but I am not sure why it appears the original data columns have also been filled. For reference, the number of observations in non-extra version of the dataset are: lyr_14c = 3196 and lyr_fraction_modern = 3464

Second sign of trouble. When I look at the values in the data table (after flattening to the lyr level), I find that for some (all?) values calculated for lyr_14c_fill_extra are also being filled to the lyr_14c, lyr_14c_SD, and lyr_14c_sigma columns. To be clear, it seems the values calculated for the "fill_extra" column are also being filled to the three variables listed above. However, the same issue does not seem to be occurring in the lyr_fraction_modern (or associated) columns

So to summarize, if users are working with the ISRaD_extra version of the dataset but are looking at the lyr_14c_SD or lyr_14c_sigma values, they will see erroneous data. Furthermore, I don't think we should be filling any non-extra columns.

If I had to guess, it may be that the fill function is leveraging something akin to tidy::mutate_if command that is predicated on columns starting with or including "lyr_14c" But that is just a guess. Strangely, as I look more closely it appears that for some rows, the values are filled for lyr_14c_SD and lyr_14c_sigma are not always identical to the lyr_14c_fill_extra value.

bpbond commented 4 years ago

If in dealing with this there's an opportunity to expand the automated tests, I'm happy to write those.

jb388 commented 4 years ago

@coreylawrence 1) The nomenclature is somewhat misleading, but actually the variable "lyr_14c_fill_extra" is not filled by the function "ISRaD.extra.fill_14c". Rather, "ISRaD.extra.fill_14c" checks to see if the variable "xxx_14c" is NA, and if so, it is filled from the variable "xxx_fm" and "xxx_obs_date_y". The intention behind "lyr_14c_fill_extra" was to provide a space for filling in 14C data from "expert knowledge".

2) This is definitely an issue---i.e. erroneous filling of the "xxx_14c_sd" and "xxx_14c_sigma" variables. I checked the code and the same function is called to fill those variables as is called to fill "xxx_14c" (and for filling "xxx_fm"), so it makes sense that they are the same...

I'll work on fixing this. Thanks for catching it.

jb388 commented 4 years ago

@bpbond Thanks for the offer. I'll think about it!

coreylawrence commented 4 years ago

@jb388

With regard to (1), this is not my recollection of the how the lyr_14c_fill_extra column was to be used. To be clear, your saying that field is for manual entry of some (to be defined) 14c correction and the "ISRaD.extra.fill_14c actually fills values into the lyr_14c field?

I thought we were operating with the policy of never filling calculated values into "raw" data columns. If that is (was) the case, then we would need a home for the merged values, which was how I have been interpreting the lyr_14c_fill_extra column.

jb388 commented 4 years ago

@coreylawrence Filling in data when possible (using other data from the same study, e.g. 14C from FM and date) was one of the primary goals of creating ISRaD_extra. When we decided to create ISRaD_extra it was with the goal of preserving the integrity of the raw data in ISRaD, and providing a second version of the data with changes made (hence "extra").

@jb388

With regard to (1), this is not my recollection of the how the lyr_14c_fill_extra column was to be used. To be clear, your saying that field is for manual entry of some (to be defined) 14c correction and the "ISRaD.extra.fill_14c actually fills values into the lyr_14c field?

Yup. This is correct.

I thought we were operating with the policy of never filling calculated values into "raw" data columns. If that is (was) the case, then we would need a home for the merged values, which was how I have been interpreting the lyr_14c_fill_extra column.

coreylawrence commented 4 years ago

Well I think something has been lost in translation, because that is not exactly what we described in the paper. My interpretation and I think how we describe it in the paper, was that the "extra" version of the data would include extra columns that were calculated but those would just be appended to the raw data (not filled into existing columns).

From the paper: _(1) ISRaD_data, which includes only data entered from the original source studies and (2) ISRaD_extra, which includes the original source data as well as gap-filled and additional variables that have been calculated or filled from external geospatial data sources. Value-added data included in the current implementation of ISRaDextra are described in Table 2.

I can see that this is a little ambiguous but Table 2 specifies the creation of new columns with the filled data.

jb388 commented 4 years ago

Hmm, yes, the wording is ambiguous in the paper. My memory of this is that we discussed it extensively: I was adamant that we preserve the raw data in the base version of ISRaD and was initially leery of ISRaD_extra (devout convert now!), but I thought we agreed it would be less confusing to modify the existing variables rather than create new variables. My understanding of the agreed on procedure was that we would allow filling of the base variables only if they could be filled from the existing data. If any outside data was used, e.g. geospatial, or expert knowledge, then we would need to create a new variable for ISRaD_extra.

I think that the documentation (of the functions) and the ISRaD_Template_Info/ISRaD_Extra_Info documents are very clear about what they do. Personally I don't see any need for change*, but of course I'm open to any suggestions.

(*The second issue, i.e. with the wrong data filled for the xxx_14c_sd and xxx_14c_sigma variables, definitely needs to be addressed).

jb388 commented 4 years ago

Hmm, yes, the wording is ambiguous in the paper. My memory of this is that we discussed it extensively: I was adamant that we preserve the raw data in the base version of ISRaD and was initially leery of ISRaD_extra (devout convert now!), but I thought we agreed it would be less confusing to modify the existing variables rather than create new variables. My understanding of the agreed on procedure was that we would allow filling of the base variables only if they could be filled from the existing data. If any outside data was used, e.g. geospatial, or expert knowledge, then we would need to create a new variable for ISRaD_extra.

I think that the documentation (of the functions) and the ISRaD_Template_Info/ISRaD_Extra_Info documents are very clear about what they do. Personally I don't see any need for change*, but of course I'm open to any suggestions.

(*The second issue, i.e. with the wrong data filled for the xxx_14c_sd and xxx_14c_sigma variables, definitely needs to be addressed).

Obviously it's unfortunate that we didn't make this clearer in the paper, but too late for that. I do think that the wording is such that we aren't directly contradicting ourselves. However, we should definitely clarify this issue in our FAQs/website documentation.

coreylawrence commented 4 years ago

I agree, we can't change the paper now. I am sure we discussed this a couple different times but I am not remembering the same outcome that you are. To be fair, it is possible that I missed some of those discussions.

My preference would be to not distribute two versions of the dataset (data and extra) that include identically named variables but not identical data. That leaves too much room for confusion and unintentional problems. An option that would split the difference is to just append a "_filled" to the end of the lyr_14c and lyr_fraction_modern columns in the "extra" data. I think that would be straightforward and avoid any confusion.

Lets wait for others to weigh in and then make a decision based on that. I'm flagging @crlsierra here because he always has a helpful perspective on such issues.

jb388 commented 4 years ago

I see your point about the potential confusion, but while the data would be different between the two variables, it would be a difference of something versus nothing (NA), not two different values. I agree it would be good to get more input. Thanks for bringing up the issue.

crlsierra commented 4 years ago

I don't have strong opinions about this issue. Since the functions that fill in missing values are well documented, users should be able to find out easily how the new values were obtained. You probably can add a note in the NEWS file of the R package, and maybe also in the README of the repository. I don't see the need to make a major change, only a clarification of users of ISRaD_extra.

jb388 commented 3 years ago

This issue has been fixed.

International-Soil-Radiocarbon-Database / ISRaD

Problem with ISRaD_Extra: lyr_14c, lyr_14c_SD, lyr_14c_sigma (and maybe others) #236