handle SRDB duplicates - Githubissues

teixeirak commented 4 years ago

I think we're duplicating most of the GPP, Reco, and NEE (NEP) records (and probably some others) across both the original ForC and the SRDB import. I need to look into how to deal with this. It may require dropping SRDB records that are close to ForC records, as we don't have time for a careful review of all the potential duplicates.

ValentineHerr commented 4 years ago

Looking into this. there are 753 records of GPP, Reco, or NEP coming from SRDB. 62 of them seem to have some diplucate issue resolved (D.precedence 0 or 1). I do see values that are close to other ForC records but they not flagged as duplicated because they have a different site or plot name, or a different age stand...

teixeirak commented 4 years ago

Right... that's the big problem. For NEE, NEP, and Reco, I think we should drop any from SRDB that are within 1 degree lat/long of what's already in ForC (and flag for duplicate review). Eddy flux sites just aren't that common...

Other variables are tricky, but I'm sure that problem occurs sometimes for them. Not sure the best way to handle that. Maybe the same?

ValentineHerr commented 4 years ago

1 degree lat-lon is ~ 110 km... that is quite large isn't it ? when we look for potential duplicate sites we do 5km.

Should I look only into ForC prior GROA import or also looking at GROA measurements?

I think I'll write a temporary script that just "flags as suspicious" those SRDB records because it is too complicated to incorporate that in the duplicate system. They won't be brought into ForC_simplified.

Also, FYI, I am removing any measurement without lat lon in ForC_simplified. (I thought I was already doing that... shouldn't change the results though because I think those are ignored later anyways).

ValentineHerr commented 4 years ago

If we include GROA when comparing "ForC+GROA" to "SRDB", and looking at ForC_simplified: 5km would remove 1758 SRDB records and 110km would remove 3231.

This is an idea of the number of records per variables (for the 110km threshold):

There would still be 3449 SRDB records (with 110km threshold)

teixeirak commented 4 years ago

You’re right, 1 degree lay/long is probably too much. How about 0.1 degrees? We can adjust later. I want to check how big the variation gets among real duplicates.

Let’s exclude just for SRDB, but please Flag as suspicious those in GROA it as well. Please create a new field to flag it suspicious, as the current field with this name is limited to records where we’ve gone back to the original publication ( if possible ) and think that their values are wrong.

ValentineHerr commented 4 years ago

should I look at GROA measurements when flagging SRDB? and vice versa? or should I only look at "original ForC"?

teixeirak commented 4 years ago

Hmmm, yes, let's look at GROA when flagging SRDB. I don't think there will be a ton of overlap, though. Let's keep a record of which records are potential conflicts.

ValentineHerr commented 4 years ago

so instead of creating a new column, I created a file for GROA and a file for SRDB that list the ForC measurements IDs when there are records for a same variable in a 11km cluster (~0.1 degree) for both ForC+SRDB and GROA and ForC+GROA and SRDB respectively. For example, if within a cluster of 11km, there are 2 GPP measurements for SRDB and 5 GPP measurements for ForC (including GROA), the table is populated with 10 rows, first column is the IDs for the ForC measurement, repeated twice, the second is the IDs of the SRDB measurements (the IDS they have in ForC), repeated 5 times each. It is not ideal but for now it will do. And for the record, this script is generating these files.

Then, when creating ForC_simplified, I load the SRDB file mentioned above and remove any measurement ID that appears in the second column.

I did it regardless of the variables.

teixeirak commented 4 years ago

Thanks! I want to come back to look at how well this worked more carefully later. Not sure if I'll have time before we submit.

teixeirak commented 4 years ago

@ValentineHerr, I want to make sure I have a clear understanding of everything that's done to handle the duplicates. I'm currently drafting an appendix on this based on my current understanding, and will then ask you to review.

ValentineHerr commented 4 years ago

Potential duplicates were defined as geographically proximate records for stands of similar age with the same variable measured in the same year (if known).

I believe we can say "of same age (if known)". I believe this is relevant to duplicate records and not sites like the name of the section and next sentence are about. Maybe the section should be renamed to " Detecting and reconciling duplicate records"?

In cases where site and plot names or reported age differed, our script detected potential duplicate sites that were geographically proximate.

If you are talking about sites, we handle them independently than records. We look at sites within 5km of eachother and then decide if they need to be merged or not (but I think there is a lot that have not been merged, or decided on). If you are talking about records, this is what I added yesterday: any record coming from SRDB is removed if it is within 11km of a Forc_prior or GROA measurement of the same variable.

In cases where a single location -- generally an established research site where multiple investigators have worked -- contained multiple plots in nested or unknown relation to one another, we grouped multiple sites into a "supersite" (e.g., Harvard Forest, Barro Colorado Island, Pasoh Forest Reserve), and duplicates within a supersite were handled in the same way as records with matching site and plot names. (VALENTINE, IS THIS ACCURATE?)

Hmm... unfortunately I don't think I ever got to using supersites... I thought I did but can't find any evidence of it... unless the D.precedence and all was edited by hand while the supersites were assigned to the records...

For suspected duplicate groups that were not flagged as supersites and had not yet been reviewed, we retained only one potential duplicate record, assigning precedence as follows: (1) original GROA record(s), (2) record(s) in ForC prior to SRDB and GROA import, (3) SRDB record(s).

I am not sure where this rule would be applied. For now all GROA data that was not identified as duplicate prior to the import has been imported and is considered as independent (except maybe a handful that needs review), regardless of how far they are from ForC_prior or SRDB records. Only SRDB records are removed if they are within 11km of a ForC_prior or GROA record.

ValentineHerr commented 4 years ago

duplicates are a nightmare... and the script that IDs them is running out of steam with ForC getting bigger...

ValentineHerr commented 4 years ago

I pushed everything based on the new rules (mentionned in this issue)

teixeirak commented 4 years ago

Thanks! Reviewing now.

teixeirak commented 4 years ago

We've done this as well as possible for now.

forc-db / ERL-review

handle SRDB duplicates #34