fixed duplicates in FAO-Commodities and in `name_to_rgn_id()`

OHI-Science / ohiprep

Ocean Health Index data layer preparation

11 stars 9 forks source link

fixed duplicates in FAO-Commodities and in `name_to_rgn_id()` #42

Closed bbest closed 10 years ago

bbest commented 10 years ago

(OHI-Science/ohicore#34)

Renamed cbind_rgn() to name_to_rgn_id() (in src/R/ohi_clean_fxns.R), since major overhaul of duplicate handling. Now the function drops the original fld_name and returns rgn_id with any duplicates collapsed by collapse_fxn. For instance with the FAO-Commodities, the following duplicates were collapsed:

                         rgn_id
  country                  73 140 186 209 244
    China, Hong Kong SAR    0   0   0 180   0  # China to be summed?
    China, Macao SAR        0   0   0 144   0  # China to be summed?
    Martinique              0  36   0   0   0  # Guadeloupe | Martinique => Guadeloupe and Martinique region
    Serbia and Montenegro   0   0  36   0   0  # Serbia | Montenegro => Serbia and Montenegro region
    Un. Sov. Soc. Rep.     36   0   0   0   0  # Un. Sov. Soc. Rep. > Russian Federation => Russia

Other fixes:

Dropped from Netherlands Antilles expansion since already in data:
- Aruba
- Curacao
Added to src/LookupTables/rgn_eez_v2013a_synonyms.csv:
- Cabo Verde (ohi_region 41)
- Ethiopia PDR (landlocked)

jules32 commented 10 years ago

Hi @bbest,

I'm editing a few lines in name_to_rgn_id.r slightly because using first returns only the very first entry, so that the output variable rgn has only one entry: rgn_id = 1 for Cocos Islands. I'm not sure what first was intending to do? I have made sure this isn't a matter of plyr instead of dplyr, but I will remove first.

jules32 commented 10 years ago

Sorry I linked those lines to the current version instead of the historical record. This link shows what those lines had been, when first was included.

jules32 commented 10 years ago

Hi again @bbest,

It also seems that the group_by function is not working as we want it to. This bit will only return two columns:

          rgn_name rgn_type
1    Cocos Islands      eez
2 Christmas Island      eez
3   Norfolk Island      eez
4 Macquarie Island      eez
5    New Caledonia      eez
6          Vanuatu      eez

so I am changing it to include rgn_id in the summarize() part:

 group_by(rgn_id) %.% 
    summarize(
      rgn_id,
      rgn_name = rgn_name,
      rgn_type = rgn_type) %.%
    ungroup()

I will keep you posted.

bbest commented 10 years ago

I like the problem labeling: "the Vanuatu issue" like "the Eritrea issue". There's something amiss here, since anytime you use group_by(rgn_id), it should always return rgn_id field after running any summarize() operation. The fact that it doesn't and you have to explicitly include rgn_id makes me suspicious as to whether it's properly grouping in the first place. I again suspect something afoul with package sequence loading, ie library(dplyr) should always come after library(reshape) and library(plyr).