identify records with potential site duplicates

teixeirak commented 5 years ago

@beckybanbury, as discussed in person...

The field potential_duplicate_group in the SITES table identifies potential site duplicates. Let's flag sites with potential duplicates that appear in your analysis. Specifically, we're concerned about instances where the same variable is recorded twice for separate sites that are flagged as potential duplicates.

beckybanbury commented 5 years ago

@teixeirak are we specifically concerned about instances where the same variable is recorded twice, with the same stand age/date for both records? I've found instances where there are potential site duplicates, but they have different stand ages (so are potentially from different years or different plots within the same site?) or years recorded. In these cases, they probably aren't duplicate measurements, so the question is how we deal with measurements across different ages/years.

So far I've found very few duplicate sites coming up in my analysis - I think that this isn't too much of a concern for my analysis specifically.

beckybanbury commented 5 years ago

@teixeirak I think the potential_duplicate_group field is done by a script; do you mean that I should flag sites in the potential_duplicate_manual field?

teixeirak commented 5 years ago

Great news that there seem to be few duplicates! We want to look at potential_duplicate_group, and also confirmed.unique. If confirmed.unique=1 for both of the potential duplicates, we know they are independent sites. We are only concerned about records that would be duplicates if they were at the same site--i.e., same variable, same year. We will need to flag and resolve those duplicates.

We're also somewhat (but less) concerned about instances where we have the same variable measured in different years. In this case, geographic.area but not plot would be correctly represented by the random effects. This is less critical.

I don't really remember the logic behind creating the potential_duplicate_manual field, but we're not really using it.

teixeirak commented 5 years ago

@beckybanbury, let's resolve this before we get too deep into interpretation. I doubt results will change much, but we don't want to have to redo the work of reviewing/ interpreting results.

beckybanbury commented 5 years ago

The plots that I identified as being potential duplicates are Pasoh, Teshio, Bonanza/BNZ, Wayquecha, and Nouragues (plots that have entries from SRDB and the original ForC database). I'm still unclear about how I need to deal with these sites though.

teixeirak commented 5 years ago

Pasoh is fixed.

teixeirak commented 5 years ago

Teshio is fixed.

teixeirak commented 5 years ago

Wayqecha was previously fixed. (None of these will be fully fixed until all the scripts are re-run to update PLOTS, ForC_simplified, etc.)

teixeirak commented 5 years ago

Nouragues is fixed.

teixeirak commented 5 years ago

Bonanza is a big job! I merged Bonanza/ BNZ sites 5A, 5C, 5D, and also discovered a number of incorrect values among these records.

teixeirak commented 5 years ago

As far as I can tell, Bonanza is now reconciled, which means that all of these sites should be fixed. We need to re-run the script to make plots and ForC_simplified.

beckybanbury commented 5 years ago

I've now updated plots and ForC_simplified from the script

beckybanbury commented 5 years ago

@teixeirak in ForC there are 5 Bonanza measurements that don't have a mean value - do you know why that's happened?

I've also noticed that when I run ForC_simplified it is adding a lot of NAs into the mean column; I'm not sure why this is (hoping it is related to the ForC measurements that don't have a value).

teixeirak commented 5 years ago

I'll check as soon as I get a chance.

teixeirak commented 5 years ago

That should be fixed. You will need to re-run the scripts, including plots (there was an extra one in there by mistake).

Also, if you didn't already do this, we need to re-run the script that identifies and deals with duplicates within Measurements.

teixeirak commented 5 years ago

Done.

forc-db / Global_Productivity

identify records with potential site duplicates #30