forc-db / ForC

Global Forest Carbon Database
https://forc-db.github.io/
Creative Commons Attribution 4.0 International
55 stars 24 forks source link

flag potential duplicates in ForC_simplified #210

Open ValentineHerr opened 4 years ago

ValentineHerr commented 4 years ago

1/ create age group 0-10-50-100-999

groups will be: -[0-10] -]10-50] -]50-100] -]100-999] -unknown (NA, NAC, NI...)

2/ create year

take date, if any, start date otherwise, unknown if none is known

3/ group by variable, geographic_area, year and age group

considering unknown year as its own group, and unknown age group as its own group too

4/ in each group, flag with these rules

teixeirak commented 4 years ago
* if only SRDB: keep all (or do the same as if only ForC below?)

no, please apply the same rules as for ForC only. I'll edit so the outline above.

ValentineHerr commented 4 years ago

grouping by year: @teixeirak, how to handle "date" "start.date" and "end.date" ? take "start date" as the "date"?

teixeirak commented 4 years ago

yes

ValentineHerr commented 4 years ago

@teixeirak, there are a few missing citation IDs (see below). Should I just take the date of loaded.from for this task?

image

ValentineHerr commented 4 years ago

actually, loaded.from doesn't exist in ForC_simplified.... can we fill in citation ID in the measurements table? that would make my life easier...

teixeirak commented 4 years ago

@teixeirak, there are a few missing citation IDs (see below). Should I just take the date of loaded.from for this task?

image

sure.

teixeirak commented 4 years ago

actually, loaded.from doesn't exist in ForC_simplified.... can we fill in citation ID in the measurements table? that would make my life easier...

Let's just put Taylor under citation ID.

Please note that I just pushed some changes to the data (resolving root biomass outlier), so be sure to pull the most recent version.

ValentineHerr commented 4 years ago

ok, thanks!

ValentineHerr commented 4 years ago

I made the change in the citation ID so don't forget to pull if you are editing the file more.

ValentineHerr commented 4 years ago

I pushed a new ForC_simplified with new "suspected.duplicate" column. let me know what you think.

It looks like it will remove 5911 of the non-managed, non-disturbed, non-no.history.info records (but there is >10000 that are flagged).

I'll work on updating the figures.

teixeirak commented 4 years ago

Awesome! I think we've found a good solution.

teixeirak commented 4 years ago

@ValentineHerr , based on my current understanding of the process, we're potentially giving preference to GROA records with now info on veg.type, and thereby losing data altogether. For the next round, we may try to address that.

teixeirak commented 4 years ago

We could also refine this to compare the estimates-- true duplicates should be the same or similar.