Give D.precedence to duplicates records

ValentineHerr commented 6 years ago

@teixeirak ,

I pushed the new system for ID-ing the duplicates. Now you can give the precedence to the records in the D.group.

Please, let me know if you find any problem.

teixeirak commented 6 years ago

Thanks, and ouch! This is way too many to go through manually. I'm going to start listing some rules to apply in order to cut this down:

[ ] If dup.num differs, the one with the higher dup.num gets precedence. This follows our previous convention (see dup.num description in metadata). We do not need to retain rank order, simply assign precedence (1) to the highest value, no precedence (0) to lower value.
[ ] If record differ only in depth, the one with greatest depth gets precedence.
[ ] If record differ only in min.dbh, the one with smallest min.dbh gets precedence.

teixeirak commented 6 years ago

[ ] If records differ only in units (C or OM) and C = 0.45 to 0.55 * OM, give precedence to OM. The logic there is that researchers use slightly varying conversion factors, so when there's a choice its best to do the conversion ourselves.

teixeirak commented 6 years ago

[ ] The handling of dates appears to be creating some false duplicates. For example, 3517 & 3518, 13542 & 13543.

teixeirak commented 6 years ago

[ ] Give precedence to multiple-year measurement periods-- If one record measurement period is >1.75 x the length of its duplicate, go with that one. (This is after resolving the issue above.)

teixeirak commented 6 years ago

[ ] After resolving all of the above, if duplicates are from different studies, the later study gets precedence. [NOTE: This isn't always ideal, as it may give precedence to an intermediary review over an original publication. However, it works as a start. Precedence can always be edited upon consultation of original pub.]

teixeirak commented 6 years ago

@ValentineHerr, please implement the above. That should leave only a small (reasonable) number of records requiring manual review.

ValentineHerr commented 6 years ago

@teixeirak , to clarify, when you say "This is after resolving the issue above" or "After resolving all of the above", you mean in case the previous issues led to multiple "1" in the D.precedence OR none of the above issues applied. Right ?

ValentineHerr commented 6 years ago

Also, just to make sure, When you say "differ only in...", is it really exclusive ? For example, I have 2 duplicates from 2 different study, with different units. If I follow your "differ only in units" statement, I am not picking OM vs C but I move on to giving the later study the precedence.

teixeirak commented 6 years ago

1st question- yes. 2nd question- Actually, let's give the OM precedence in the example above.

ValentineHerr commented 6 years ago

@teixeirak, could you review the order below? The rule would be to keep going down the list if D.precedence is still NA or if multiple "1" were assigned.

Take OM over C
Take longer study when length_longer_record = 1.75 * length_of_its_duplicates
Take later study over older
Take biggest depth (deepest record)
Take smallest min.dbh

Should remain: records that only differ in method.ID and/or notes. I'll double check if dup.num is given to those and if yes, I'll assign precedence. I don't want to do it first to make sure most of the records are treated the same way.

Let me know if you approve.

teixeirak commented 6 years ago

Take biggest depth (deepest record)
Take smallest min.dbh
Take longer study when length_longer_record = 1.75 * length_of_its_duplicates
Take OM over C
Take later study over older

I have mixed feelings about giving this automated process precedence over the rankings that were determined manually, as that ranking would often incorporate specific knowledge about the records. Let's try this and compare before finalizing that decision.

ValentineHerr commented 6 years ago

@teixeirak, double checking a couple things:

For the C or OM unit rule, I understand:
- If C is not within 0.45 to 0.55 times OM --> give 1 to both C and OM and move down the list
- If there are several C (and several OM), give 1 to all OM, give 0 to all C that are within 0.45 to 0.55 times any OM, give 1 to all C that are not within any range of the OM and move down the list.
same as above nut for duration of the record
min.dbh:
- If some min.dbh are reported and others are not (NA), give 1 to the smallest min.dbh AND to the missing min.dbh
same as above for depth
I am thinking about coding on the notes field, looking for "only" and "+" or "all", and giving 0 or 1 for D.predence when there is a clear distinction about how inclusive the records are. If I manage to do that, when should this happen in the list? In other words, how important it is compared to min.dbh, depth etc... the higher in the list the more important.

ValentineHerr commented 6 years ago

@teixeirak, I just pushed the measurements with updated D.precedence. There is 100 records that need to be done manually (they have NAC in D.precedence column and "D.precedence given manually." in notes). FYI, 252 records were given D.Precedence based on dup.num. This is specified in the notes too.

teixeirak commented 6 years ago

Thank you. Could you please put the notes on this in conflict.notes instead of notes?

ValentineHerr commented 6 years ago

Oh yes sorry I forgot about that column. Ok done.

teixeirak commented 6 years ago

There are some records with NA in the conflicts field. Could you please fix?

ValentineHerr commented 6 years ago

Done. Sorry for that. I didn't see some records made it through the mesh! I double checked and they should all be Independent.

teixeirak commented 6 years ago

Fix Faber-Langendoen_1992_ecor sites:

[ ] 1024-1027, 1028-1031, 1034-1037, and 1038-1041 are each two pairs of replicates, each with two methods (just overstory or overstory + understory). I fixed notes to identify replicate pairs, and added min.dbh. Hopefully the code will get this right now, although it might pick up differences in the notes identifying replicate pair ID ("front 1 or front 2"), so may need to be fixed manually.
[ ] 1042-1045 are labeled S (with no corresponding "s"). They should be R, D-- same as the other replicate pairs.

teixeirak commented 6 years ago

It looks like you've accidently printed out several extra columns at the end of MEASUREMENTS.

ValentineHerr commented 6 years ago

shoot sorry for that... I fixed it

ValentineHerr commented 6 years ago

For 1042-1045 I think they are getting S because they have no dates at all and no stand.age. So technically we don't know if they are Replicates or not. It would take a bit more coding to handle this special case. I am happy to do it if you think it is necessary.

teixeirak commented 6 years ago

999 means stand age is intact/ undisturbed/ old growth, not unknown. Unknown stand ages get missing values codes. So please change the code so that it will treat ‘999’ as such. Let’s say ‘999’ conflicts with stand.age>100.

teixeirak commented 6 years ago

Alternatively, if coding this is complicated, its fine to fix by hand.

ValentineHerr commented 6 years ago

No it is okay, it should be fine. I forgot about this code. I think I was thinking of climate data where it interpreted as "missing".

Have you looked at everything ? did you edit the D.precedence ? Let me know when I can run the code again.

teixeirak commented 6 years ago

Please run it now. I'll edit D.precedence once those are done. I've scanned the other records and haven't noticed other problems, but it is possible I'll find more as I look them over carefully. This is tricky in that D.precedence can be edited by hand, but I don't want to just edit the other columns; those need to be fixed in the code (unless we give up on the idea of having the code get them all right).

teixeirak commented 6 years ago

@ValentineHerr, I've finished assigning D.precedence. I edited some records by hand (and added conflict.notes). There were a couple instances where I changed fields other than D.precedence. I also deleted some records.

ValentineHerr commented 6 years ago

Sorry I had to run an errand Friday afternoon and it took longer than expected... I understand that I don't need to re-run anything, right ?

I am working on resolving conflicts now. I found one records (ID 15293) that has "1" for D.precedence but capital S in conflict. It was given manually. Did you mean it?

ValentineHerr commented 6 years ago

That is a tricky one but I think there should only zeroes for the precedence in D.group 720 and the 4th record should have received a 1 for the precedence in D.group 721. So in the end we would only keep records 4 and 5.

Do you agree ?

ID	measurement.ID	sites.sitename	plot.name	stand.age	variable.name	date	start.date	end.date	conflicts	S.group	D.group	D.precedence	conflict.type	conflicts.notes
1	15290	Tumbarumba flux station	mature managed forest	90	NEE_C	NA	2002.084932	2002.832877	D	NA	720,721	0	M	T	D.precedence given manually.
2	15291	Tumbarumba flux station	mature managed forest	90	NEE_C	NA	2002	2003	D,S	176	720	0	M	T	NA
3	15293	Tumbarumba flux station	mature managed forest	90	NEE_C	NA	2001.084932	2004.163934	D,S	177	720	1	M	T	D.precedence given manually.
4	15294	Tumbarumba flux station	mature managed forest	90	NEE_C	2002	NA	NA	D,s	176,177	721	0	M	T	NA
5	15295	Tumbarumba flux station	mature managed forest	90	NEE_C	2003	NA	NA	s	176,177	NA	NA	M	T	NA

ValentineHerr commented 6 years ago

[ ] 10546 and 10552 (in ORNL-FACE, elevated CO2) have NA for conflicts. I am changing that to "independent".
[ ] 10487 (in ORNL-FACE, ambient ORNL-FACE) has "I" for conflicts but belongs to a D.group (387) with zero for D.precedence. @teixeirak, I think you manually edited this by hand. should I replace "I" by "D" in conflicts ?

teixeirak commented 6 years ago

Regarding Turbarumba, that is tricky. I agree with your assessment.

teixeirak commented 6 years ago

Regarding ORNL-FACE, yes, please fix as you suggest.

ValentineHerr commented 6 years ago

I fixed a few problems that I found. Not sure why my code didn't catch them but there were not many of them.

forc-db / ForC

Give D.precedence to duplicates records #80