BiologicalRecordsCentre / ABLE

Assessing ButterfLies in Europe project repository
2 stars 3 forks source link

Annual .csv exports - duplicated occurrences #543

Open xaviermestdagh opened 1 year ago

xaviermestdagh commented 1 year ago

Hi,

I am using annual exports to collect all records from the web app.

I found duplicated records for several walks. For instance, the Sample ID 16333273 (site :161 the 10/06/2021). There are 5 records of Pyrgus malvae on the section 9 (each with a different Occurrence ID). When editing the walk through butterfly-monitoring.net, only 1 individual has been observed on that section. (but Pyrgus malvae has been observed on 5 sections of this walk, a coincidence?). Here below a list of other visit IDs with duplicate records. (last column is the number of duplicates). image

SarahVray commented 1 year ago

Hi, to complement the example that Xavier gives for the Sample ID 16333273, here are two snapshots: 1) The 5 different occurence ID (= duplicates) for Pyrgus malvae in S9: Duplicates 2) The situation on butterfly-monitoring.net: Lallengerbierg_10-06-2021 Thanks in advance !

DavidRoy commented 1 year ago

@SarahVray @xaviermestdagh are there other examples of duplication or is this a one-off?

xaviermestdagh commented 1 year ago

Query12.xlsx

Hi @DavidRoy in the list attached here above, all rows with value 2 or more in the last column have duplicates

DavidRoy commented 1 year ago

@JimBacon does this relate to the duplication issue you were looking at previously?

JimBacon commented 1 year ago

There seem to be a few things going on here. I'm going to look at the following 3 visits in detail

Masquerading as Xavier, I did the download of occurrences from the annual report, combining data for all recorders and choosing the following options to include the records above.

Comparing my download with the report from Xavier

First conclusion is that Xavier and I are either getting different downloads or there is something amiss with his analysis. However, this does not remove the fact that there are apparently duplicates.

If I query the samples and occurrences tables in the database, the results I get are identical to the contents of the downloads that I made.

I.e. this query

select 
    walk.id,
    section.code,
    cttl.taxon,
    count(*)
from samples walk
join samples child
    on child.parent_id = walk.id and child.deleted = false
join locations section
    on section.id = child.location_id and section.deleted = false
join occurrences o
    on o.sample_id = child.id and o.deleted = false
join cache_taxa_taxon_lists cttl
    on cttl.id = o.taxa_taxon_list_id
where walk.id in (16333273, 14423574, 6017490)
group by walk.id, code, taxon
having count(*) > 1
order by walk.id, code, taxon
Returns these results id code taxon count
14423574 S1 Pieris 4
14423574 S2 Anthocharis cardamines 2
14423574 S2 Callophrys rubi 2
14423574 S2 Vanessa cardui 4
14423574 S3 Aglais io 2
14423574 S3 Anthocharis cardamines 2
14423574 S4 Pieris 2
14423574 S5 Aglais urticae 2
14423574 S5 Pieris 4
14423574 S6 Anthocharis cardamines 2
14423574 S6 Pieris 2
14423574 S9 Callophrys rubi 4
16333273 S9 Pyrgus malvae 5

The fact that, e.g. the database has 5 occurrences of Pyrgus malvae on section 9 of visit 16333273 is perfectly okay as far as the database is concerned. The fact that the website only displays one is probably a limitation of the website because this is not the expected data structure. The fact that this exists I expect is due to an error of data submission because I don't recall seeing this with other websites.

Querying across the whole EBMS Transects survey I find 2729 instances of this kind of duplicate, where a section has the same taxon occurrence more than once. The most recent is walk_id 21554775, section 7. This was created today so the problem, which it appears to be, is still present.

The start and end times for this were given as 12:16 to 13:10 and the database records were created at 12:11. Since the record is from Italy and the server time is in the UK time zone, this suggests that the results were submitted in real time. I therefore wonder if they have come from a mobile app.

Look at this:

Screenshot_20230404-211921

I can add the same species twice in a section but the website can only show it once.

Capture

In the database I get duplicates as seen previously. image

This does not mean all these duplicates have arisen in this way but it is at least one explanation. Questions:

JimBacon commented 1 year ago

Sorry, you might only want to read the bottom third of that last post.

JimBacon commented 1 year ago

@kazlauskis I spoke with David and we think this is one for you to look at. I find the issue is "When using the butterfly app to enter transect surveys, it is possible to select the same species multiple times within a section." The desired behaviour is to only allow a species to be added once per section. You may be able to help a user trying to add a species for a second time by showing the existing row.

JimBacon commented 1 year ago

There is some fixing up of the data to be done once the source of the problem has been resolved so pass this back to me, @kazlauskis, when you are done.

kazlauskis commented 1 year ago

In the app, if a species has already been entered and the user tries to add a second entry, we show a greyed-out result to indicate it is already part of the list. We can easily remove that option altogether, which will prevent multiple same-species entries, but then this raises a question of what to do when an existing species entry ID needs to be changed to something already part of the list. For example, someone records a Small white and a Large white species but later decides the Small white is the Large white. Currently, we allow changing the ID, which results in multiple same-species entries. Any thoughts?

JimBacon commented 1 year ago

The options I can imagine for changing ID are

  1. The recorder deletes Small White and modifies the abundance of Large White themselves,
  2. The recorder renames Small White to Large White and the app automatically combines the two entries, calculating the sum of abundances,

Another option would be to keep the UI as it is and combine records of the same taxon at submission but that is less open. I'm also wondering if people are able to mix latin names and common names and create duplicates in that way too.

I don't feel I am the best person to decide, though, so let's tag @DavidRoy

DavidRoy commented 1 year ago

@kazlauskis @JimBacon I think the app should do the work on this, so Jim's option 2 sounds right

kazlauskis commented 1 year ago

Each occurrence entry can contain [species, stage, comment, count, photos]. Without losing information, we can automatically merge the [species, count, photos]. We could concatenate comments into one, but it isn't clear what to do when someone has two entries with different stages, say, an Adult and a Larva. How should we merge the stage attribute?

JimBacon commented 1 year ago

That's interesting. According to the description of the butterfly transect protocol, one only records adults. The website does not allow recording of life stage either. Perhaps the option should be removed from the app. There is a risk that egg and larva records will be interpreted as adults when analysing results. Another one for @DavidRoy.

For info, in the EBMS Transects survey (id=562) there are

chrisvanswaay commented 1 year ago

We use the single-species counts for the counts of larvae of Proserpinus proserpina and excuviae of Stylurus flavipes (a dragonfly which is virtually impossible to count in another way). We also try to promote egg counts of Thecla betulae and Phengaris alcon using this, but not sure how much it has been used. Still I would like to keep this in, as such counts will become more important in future. But mixing adult and larval counts is probably not a good idea indeed.

larspett commented 1 year ago

I would be in favour of having life stage other than adult inactive as default. Eggs and larvae are (as seen in the example) rarely reported except during targeted surveys

JimBacon commented 1 year ago

Sorry, I was only meaning to suggest removing life stage from butterfly transect surveys. Not any of the others.

xaviermestdagh commented 1 year ago

I agree with @larspett to avoid having other stages than adults in transect counts, especially with the transect protocol focusing on adults. I guess @chrisvanswaay was speaking about single species (time) counts.

chrisvanswaay commented 1 year ago

Thanks, yes I was only referring to single species counts.

DavidRoy commented 1 year ago

@JimBacon @kazlauskis I agree that the transect surveys should be restricted to adults

xaviermestdagh commented 1 year ago

The fact that, e.g. the database has 5 occurrences of Pyrgus malvae on section 9 of visit 16333273 is perfectly okay as far as the database is concerned.

@JimBacon I had the opportunity to check the field paper form of Nicolas (the observer of this visit). He observed 1 individual of Pyrgus malvae on 5 sections (sections 2, 8, 9, 11 and 12). No 5 individuals on any of them. The encoding was done using the web encoding system, not the mobile app. I hope this is consistent with your investigation of the issue...

JimBacon commented 1 year ago

Hi @xaviermestdagh, That is very helpful information. It indicates that a second mechanism exists for creating these duplicates which I have not yet discovered.

kazlauskis commented 1 year ago

@CrisSevilleja, this requirement to add the stage and other attributes to transects originally came from you. Are you also OK if we remove the stage?

JimBacon commented 1 year ago
CrisSevilleja commented 1 year ago

sorry for not responding earlier to this issue @kazlauskis . Yes, I agree on removing the attribute stage from the transect as we assume we monitor only butterfly adults. Keep the stage field for other surveys, 15 min counts basically.

Now I am wondering about the possibility to enter the sex of each individual or even the caste (bumblebees) inside of every occurrence. As Karolis mentioned

Each occurrence entry can contain [species, stage, comment, count, photos]

I don't want to complicate this issue more, but I thought it would be interesting to discuss it here. Anyway, avoiding duplication is the priority and recording well the data should be done. I prefer to give priority to that than including sexes or cast inside the occurrences.

I will ask David about adding the bumblebee cast as a suggestion to the transect survey.

kazlauskis commented 1 year ago

@Vilius-Stankaitis

Vilius-Stankaitis commented 1 year ago

Hi, I have removed the "Stage" attribute, but in the background should the new occurrence be set to "Adult" or it should be empty?

DavidRoy commented 1 year ago

yes, please set stage to Adult

JimBacon commented 1 year ago

There is still an outstanding task of " @JimBacon look for another mechanism that could account for duplicates arising from the website."

JimBacon commented 1 year ago

I have not yet found a way to create duplicates by entering data through the website interface. Looking at who is creating duplicates, I found myself to be one of the main sources. This turned out to be from importing Croatian records supplied by file. The duplicates exist in the original data. I have contacted the organiser to clarify what we should do there.

JimBacon commented 1 year ago

The greatest source of duplicates (same taxon with multiple counts on one section) is the import of Spanish records. @DavidRoy, did you import the Spanish records? If so, do you still have the original files that you could send me?

There are over 2000 duplicate sets. For example, the first 5 sets from a query are shown in the screenshot below.

image

There is also evidence of a small number where corrections appear to have been attempted. In the screenshot below, an occurrence has been added some months after the import having a count equal to the sum of the duplicates. Unfortunately, the new occurrence has not replaced the duplicates so has added to the problem.

image

For reference, the query was

with dups as (
select 
    walk.id walk_id,
    child.id child_id,
    o.taxa_taxon_list_id,
    count(*)
from samples walk
join samples child
    on child.parent_id = walk.id and child.deleted = false
join locations section
    on section.id = child.location_id and section.deleted = false
join occurrences o
    on o.sample_id = child.id and o.deleted = false
where walk.survey_id = 562
    and walk.created_by_id = 1
group by walk.id, child.id, o.taxa_taxon_list_id
having count(*) > 1
order by walk.id, child.id, o.taxa_taxon_list_id
)
select parent_sample_id walk_id, o.id occurrence_id, location_name, 
    date_start, preferred_taxon, attr_sex_stage_count count, o.created_on
from cache_occurrences_functional o
join dups 
    on dups.child_id = o.sample_id
    and dups.taxa_taxon_list_id = o.taxa_taxon_list_id
join cache_taxa_taxon_lists ttl
    on ttl.id = o.taxa_taxon_list_id
join cache_occurrences_nonfunctional onf 
    on onf.id = o.id
order by location_name, date_start, preferred_taxon, o.created_on