forc-db / GROA

This repository houses data and code for the Global Reforestation Opportunity Assessment (GROA) led by Susan Cook-Patton of the Nature Conservancy.
Creative Commons Attribution 4.0 International
31 stars 10 forks source link

Duplicated site.id + link between site.id and site.sitename #17

Closed ValentineHerr closed 4 years ago

ValentineHerr commented 5 years ago

I am creating a new issue to deal with site.id, site.name and plot.id here to be able to find it later more easily. Sorry it is a long one but if both @CookPatton and @teixeirak could have a look and let me know if the solutions I suggested in bold below seem right to you or if I am misinterpreting something, that would be great. (also let me know if I am not clear in my descriptions)...

Here are the different issues that I identified:

FIRST - in sitesf.csv, there are 31 duplicated site.id, of those:

SECOND - in sitesf.csv there are 11 pairs or trios of sites with different site.id but same site.sitename, of those:

THIRD - I realized that site.sitename in sitesf.csv is counter-intuitively not the same as site.sitename in nonsoil_liter_CWD.csv, this means that:

I'll make all of the fixes in my code so that I am not messing up with the @CookPatton's data, BUT @CookPatton, if you think you need to change something on your end, please let me know when you do it so that I remove the piece of code that will then become irrelevant and may create errors.

I hope that all of this will make all of the problems I am getting disappear...

CookPatton commented 5 years ago

@ValentineHerr. First of all wow - thanks for the thorough run down. I only have time for a quick response here. Some will require a bit more digging.

(1) Agree that those with same geolocation but different elevation etc should be treated as a single site.

(2) Agree that those with same geolocation but different names should be a single site. If you need to change the site name to be a single one, that's fine and please do adjust on your end.

(3) For the ones where the geolocation do not match, I would still keep 293, 2218, 2219 as a single site. I went back to my notes and I had to make judgement calls and adjustments. For example, for site.id 293 my notes say that for study. id 9362, they include the "same sites as 293-295 but data are aggregated by forest age rather than providing full details, took average lat/long and elevation across sites here." And for site.id 2218/study.id 9222: "averaged Lewis Canyon North/South." And for site.id 2218 & 2219/study.id 11321 "found average of density by young values, geolocation for each from googlemaps."

(4) 14006 is a total error. Thank you for catching it. Those should be two sites. That is something I will fix on my end and propagate through the datasheets (and send the cleaned version to you).

(5) I'm not quite tracking the issue for the nonsoil_litter_CWD.csv site.sitename. They should be the same names in sitesf and nonsoil_litter_CWD.csv. But regardless, I established the numeric codes to join data tables and never intended to join on the names. So the process should be cleaner/smoother if you focus on site.id rather than the names, like you propose.

teixeirak commented 5 years ago

Let's be careful with how these are changed, so as to not mess up the work that we (mostly Abby) have done on reviewing potential GROA-ForC site duplicates. @ValentineHerr, perhaps you could highlight any changes that would create problems on this end? Certainly, changing site.id for sites that are not involved in these categories would be very problematic (I assume you wouldn't do that, but just making sure!).

ValentineHerr commented 5 years ago

Thank you both for your quick answers. @teixeirak, do you have a guidance on how to handle (3) in @CookPatton's answer?

teixeirak commented 5 years ago

Could you clarify what you need from me? Is the question how to handle in ForC? If so, good question! Its a bit tricky to say without looking at the studies. It sounds like perhaps they would be the same site but different plots.

ValentineHerr commented 5 years ago

Yes that is what I meant. Ok, I'll try to see what I can do. Thanks!

ValentineHerr commented 5 years ago

@CookPatton, do you agree that site.id 2199/2049 as well as site.id 3140/13974 and site.id 3907/5577 are the same sites? They have the same coordinates and almost all the same elevation AMP etc... (see "second" in the edited first comment of this issue)

Sorry if we already talked about these sites. I couldn't find where.

ValentineHerr commented 5 years ago

1 trio of site.id with same sites.sitename (Luquillo Experimental Forest 100/3817/2414) is more complex. Our intern Abby identified both site.id 100 and 3817 to be duplicates of ForC site.id 1142 AND the measurements references between ForC site 1142 and GROA site 2414 are matching exactly.--> So I think we will merge those 3 sites. @CookPatton, let us know if you feel strongly against this.

ValentineHerr commented 5 years ago

1 duo of site.id with same sites.sitename (El Refugio 293/2417) is more complex. Our intern Abby identified both site.id 293and 2417 to be duplicates of ForC site.id 563. --> So I think we will merge those 2 sites. @CookPatton, let us know if you feel strongly against this.

CookPatton commented 5 years ago

@ValentineHerr for purposes of ForC merging, feel free to make judgement calls about when to combine/merge. My decision rule is still if it has unique coordinates it should have a unique site.id. Names are more flexible.

For site.id 2199/2049 as well as site.id 3140/13974 and site.id 3907/5577 - I'll double check those. If they have the same coordinates then they should have the same site.id. I'm not sure why they don't.

CookPatton commented 5 years ago

@ValentineHerr

site.id 2199 is a truly duplicated row and there are no additional measurements associated with that site, so I deleted the entire line. Only 2046 (note, not 2049) should remain for Sungai Wain.

site.id 3140/13974 are the same site and I replaced 13974 with 3140 throughout site.id 3907/5577 are the same site and I replaced 5577 with 3907 throughout

site.id 14006 was used twice. I changed the second instance (study.id 2244) to site.id 14266

I'll push new data at the end of the day.

CookPatton commented 5 years ago

@ValentineHerr also for sites.sitename in the sitesf.csv versus in nonsoil_litter_CWD.csv. I think they actually are correct, but I used a different method than you do for ForC. In short, you should join on site.id rather than name, because I allowed there to be multiple names for a given site. These are locations that shared the same coordinates, but in the paper were described differently so I used the text field to track which data point went where. Hopefully this won't be an issue for you if you join on site.id.

ValentineHerr commented 5 years ago

Thanks for following up on that. All is good, after I dealt with the duplicated site.id I mentioned above, I was able to merge things using site.id only.

ValentineHerr commented 5 years ago

@CookPatton, I believe the new version of sitesf.csv you pushed is missing some sites. For example site 2290 (Griffin) is not in there anymore. There might be ~150 sites missing like that. Would you mind checking into it and putting them back (or removing the corresponding measurements in the case you removed the site for a good reason)?

CookPatton commented 5 years ago

@ValentineHerr forgive the back and forth, but the problem is that there I sites I have deleted from my analysis and its not clear whether I should leave them for you all. For example, I found two sites (site.id = 8919, and 5358) where the dominant species are bamboo and mangrove species. Neither of these fit with my analysis, but we have the data...do you want them? I also remove a lot of sites where the geolocation puts them in a xeric shrubland or montane grassland. There are a few where the geolocation is in the ocean. Do you want any of these? @teixeirak please advise too!

ValentineHerr commented 5 years ago

@teixeirak , when you get a chance, could you advise on this? Thanks!

teixeirak commented 5 years ago

@ValentineHerr, @CookPatton, sorry for the slow response.
Regarding mangroves and bamboo- I err on including sites rather than excluding, but I wouldn't make any additional effort to include non-forest sites.
Re forests that fall in non-forest ecoregions- We definitely want to keep these, as there are of course patches of forests in non-forest ecoregions. If the coordinates are bad, they could be included with NAC for coordinates, which could later be inferred from maps. However, if you (@CookPatton) have already checked and coordinates cannot be fixed, these sites should be dropped.