borealbirds / WildTrax-dataHarmonization

This repo aims to keep everything that is needed to translate source data to expected WildTrax upload format
1 stars 1 forks source link

NEFBMP2012-19 - Duplicates in source data #1

Open MelinaHoule opened 2 years ago

MelinaHoule commented 2 years ago

Duplicates are based on Location, Date/Time, Species, Abundance and Protocols (distance/duration).

NEFBMP2012-19 has duplicates in the source data (sheet : Bird Data)

Example: Point_number = 601; Observer: Wildgust, Allon; Date: 2015-06-20; species: BADO ;

We treat them as duplicates for now.

Waiting to hear back from our Partner to validate they are real duplicates or if we should add them up to make an abundance of 2.

MelinaHoule commented 2 years ago

Answer from the data partner: "I don't believe these are duplicate entries--all data were checked with original field sheets after data entry occurred, so these raw data should be summed. Of course I can't rule out the possibility that there was only one BADO which was double-entered, and then that error was missed during the error-checking process, but that would be an isolated occurrence and very unlikely to happen."

Duplicates occur 10% of the time. It can't be considered as isolated. I propose to sum those rows.

Another case of duplicate exist: 24 rows have identical attributes with the exception of detection_cues. Since detection cues is recorded in the extended table, I propose that we sum the abundance in the survey table, but split them apart in the extended table to record the proper behavior. Abundance attributes is found in both table. To avoid confusion, we may need to rename abundance in the extended table to reflect that difference.