forc-db / ForC

Global Forest Carbon Database
https://forc-db.github.io/
Creative Commons Attribution 4.0 International
55 stars 24 forks source link

Give D.precedence to duplicates records #80

Closed ValentineHerr closed 6 years ago

ValentineHerr commented 6 years ago

@teixeirak ,

I pushed the new system for ID-ing the duplicates. Now you can give the precedence to the records in the D.group.

Please, let me know if you find any problem.

teixeirak commented 6 years ago

Thanks, and ouch! This is way too many to go through manually. I'm going to start listing some rules to apply in order to cut this down:

teixeirak commented 6 years ago
teixeirak commented 6 years ago
teixeirak commented 6 years ago
teixeirak commented 6 years ago
teixeirak commented 6 years ago

@ValentineHerr, please implement the above. That should leave only a small (reasonable) number of records requiring manual review.

ValentineHerr commented 6 years ago

@teixeirak , to clarify, when you say "This is after resolving the issue above" or "After resolving all of the above", you mean in case the previous issues led to multiple "1" in the D.precedence OR none of the above issues applied. Right ?

ValentineHerr commented 6 years ago

Also, just to make sure, When you say "differ only in...", is it really exclusive ? For example, I have 2 duplicates from 2 different study, with different units. If I follow your "differ only in units" statement, I am not picking OM vs C but I move on to giving the later study the precedence.

teixeirak commented 6 years ago

1st question- yes. 2nd question- Actually, let's give the OM precedence in the example above.

ValentineHerr commented 6 years ago

@teixeirak, could you review the order below? The rule would be to keep going down the list if D.precedence is still NA or if multiple "1" were assigned.

  1. Take OM over C
  2. Take longer study when length_longer_record = 1.75 * length_of_its_duplicates
  3. Take later study over older
  4. Take biggest depth (deepest record)
  5. Take smallest min.dbh

Should remain: records that only differ in method.ID and/or notes. I'll double check if dup.num is given to those and if yes, I'll assign precedence. I don't want to do it first to make sure most of the records are treated the same way.

Let me know if you approve.

teixeirak commented 6 years ago
  1. Take biggest depth (deepest record)
  2. Take smallest min.dbh
  3. Take longer study when length_longer_record = 1.75 * length_of_its_duplicates
  4. Take OM over C
  5. Take later study over older

I have mixed feelings about giving this automated process precedence over the rankings that were determined manually, as that ranking would often incorporate specific knowledge about the records. Let's try this and compare before finalizing that decision.

ValentineHerr commented 6 years ago

@teixeirak, double checking a couple things:

  1. For the C or OM unit rule, I understand:

    • If C is not within 0.45 to 0.55 times OM --> give 1 to both C and OM and move down the list
    • If there are several C (and several OM), give 1 to all OM, give 0 to all C that are within 0.45 to 0.55 times any OM, give 1 to all C that are not within any range of the OM and move down the list.
  2. same as above nut for duration of the record

  3. min.dbh:

    • If some min.dbh are reported and others are not (NA), give 1 to the smallest min.dbh AND to the missing min.dbh
  4. same as above for depth

  5. I am thinking about coding on the notes field, looking for "only" and "+" or "all", and giving 0 or 1 for D.predence when there is a clear distinction about how inclusive the records are. If I manage to do that, when should this happen in the list? In other words, how important it is compared to min.dbh, depth etc... the higher in the list the more important.

ValentineHerr commented 6 years ago

@teixeirak, I just pushed the measurements with updated D.precedence. There is 100 records that need to be done manually (they have NAC in D.precedence column and "D.precedence given manually." in notes). FYI, 252 records were given D.Precedence based on dup.num. This is specified in the notes too.

teixeirak commented 6 years ago

Thank you. Could you please put the notes on this in conflict.notes instead of notes?

ValentineHerr commented 6 years ago

Oh yes sorry I forgot about that column. Ok done.

teixeirak commented 6 years ago

There are some records with NA in the conflicts field. Could you please fix?

ValentineHerr commented 6 years ago

Done. Sorry for that. I didn't see some records made it through the mesh! I double checked and they should all be Independent.

teixeirak commented 6 years ago

Fix Faber-Langendoen_1992_ecor sites:

teixeirak commented 6 years ago

It looks like you've accidently printed out several extra columns at the end of MEASUREMENTS.

ValentineHerr commented 6 years ago

shoot sorry for that... I fixed it

ValentineHerr commented 6 years ago

For 1042-1045 I think they are getting S because they have no dates at all and no stand.age. So technically we don't know if they are Replicates or not. It would take a bit more coding to handle this special case. I am happy to do it if you think it is necessary.

teixeirak commented 6 years ago

999 means stand age is intact/ undisturbed/ old growth, not unknown. Unknown stand ages get missing values codes. So please change the code so that it will treat ‘999’ as such. Let’s say ‘999’ conflicts with stand.age>100.

teixeirak commented 6 years ago

Alternatively, if coding this is complicated, its fine to fix by hand.

ValentineHerr commented 6 years ago

No it is okay, it should be fine. I forgot about this code. I think I was thinking of climate data where it interpreted as "missing".

Have you looked at everything ? did you edit the D.precedence ? Let me know when I can run the code again.

teixeirak commented 6 years ago

Please run it now. I'll edit D.precedence once those are done. I've scanned the other records and haven't noticed other problems, but it is possible I'll find more as I look them over carefully. This is tricky in that D.precedence can be edited by hand, but I don't want to just edit the other columns; those need to be fixed in the code (unless we give up on the idea of having the code get them all right).

teixeirak commented 6 years ago

@ValentineHerr, I've finished assigning D.precedence. I edited some records by hand (and added conflict.notes). There were a couple instances where I changed fields other than D.precedence. I also deleted some records.

ValentineHerr commented 6 years ago

Sorry I had to run an errand Friday afternoon and it took longer than expected... I understand that I don't need to re-run anything, right ?

I am working on resolving conflicts now. I found one records (ID 15293) that has "1" for D.precedence but capital S in conflict. It was given manually. Did you mean it?

ValentineHerr commented 6 years ago

That is a tricky one but I think there should only zeroes for the precedence in D.group 720 and the 4th record should have received a 1 for the precedence in D.group 721. So in the end we would only keep records 4 and 5.

Do you agree ?

ID measurement.ID sites.sitename plot.name stand.age variable.name date start.date end.date conflicts S.group D.group D.precedence conflict.type conflicts.notes
1 15290 Tumbarumba flux station mature managed forest 90 NEE_C NA 2002.084932 2002.832877 D NA 720,721 0 M T D.precedence given manually.
2 15291 Tumbarumba flux station mature managed forest 90 NEE_C NA 2002 2003 D,S 176 720 0 M T NA
3 15293 Tumbarumba flux station mature managed forest 90 NEE_C NA 2001.084932 2004.163934 D,S 177 720 1 M T D.precedence given manually.
4 15294 Tumbarumba flux station mature managed forest 90 NEE_C 2002 NA NA D,s 176,177 721 0 M T NA
5 15295 Tumbarumba flux station mature managed forest 90 NEE_C 2003 NA NA s 176,177 NA NA M T NA
ValentineHerr commented 6 years ago
teixeirak commented 6 years ago

Regarding Turbarumba, that is tricky. I agree with your assessment.

teixeirak commented 6 years ago

Regarding ORNL-FACE, yes, please fix as you suggest.

ValentineHerr commented 6 years ago

I fixed a few problems that I found. Not sure why my code didn't catch them but there were not many of them.