SPI-Birds / pipelines

Pipelines for generating a standard data format for bird data
2 stars 6 forks source link

Fix issue with unique IndvID's within populations but not within datasets #170

Open StefanVriend opened 3 years ago

StefanVriend commented 3 years ago

Currently, check I2 checks whether there are duplicated IndvIDs within populations and throws an error if so. Among populations, IndvIDs are not assumed to be unique.

However, when working on the UAN pipeline, I found 6 cases of individuals captured at both populations (i.e. BOS and PEE). Most likely these individuals are not different, so perhaps we need to update this check to test for unique IDs among data owners? Alternatively, this can be included in a test like mentioned in #29.

What are your thoughts, @LiamDBailey, @cwtyson, @ChloeRN? Do you ever come across similar situations in other datasets? If so, how do you then deal with it?

cwtyson commented 3 years ago

In the GLA data (which covers 5 populations), there are about 100 cases where individuals were captured at two populations. In creating the Individual data from the Capture data, I used the CapturePopID of the first capture to assign the PopID for these individuals. In this case, it seems worthwhile to use the data we have from the multiple populations to give more complete information about the individual. If we can be confident that band numbers likely refer to the same individual, then it does seem like it would make sense to check for duplicated IDs across data owners.

StefanVriend commented 3 years ago

So, if I understand correctly, Chris, you interpret that these observations indeed refer to the same individual? That is, 1 row in Individual data (with PopID = first(CapturePopID)), and multiple rows in Capture data with varying CapturePopIDs?

ChloeRN commented 3 years ago

In the PiedFlyNet data, this also happens and actually constitutes very valuable information on immigration/emigration! @LiamDBailey and I talked about this a bit while I was working with the combined pipeline for all PiedFlyNet datasets. If I recall correctly, we agreed that in such cases, there should be multiple entries in Individual data, i.e. one for each IndvID-PopID combination, but static data (such as RingYear) should be shared by all entries. Let me double-check this quickly!

(If that is indeed the case, then @StefanVriend's suggestion to test for unique IDs among data owners instead of populations may be sensible!)

cwtyson commented 3 years ago

Stefan, correct. Though this was something that I want to confirm with the data owner.

StefanVriend commented 3 years ago

If it is indeed so as you describe, @ChloeRN, we have two approaches/solutions to the same "problem". Both have their merit and make sense, but maybe it's good to find a solution that's consistent across pipelines?

In any case, I think you make a good point, @cwtyson. Regardless of what our approach is or will be, this should be verified with the data owner.

ChloeRN commented 3 years ago

Okay I double-checked. For the PFN pipeline (where the data owner has confirmed that same IndvID's in different populations are the same individual) we do the following:

The main idea behind doing this is that when we subset by PopID to provide data on request, all relevant individuals are contained in the data subset. With your current solution @cwtyson, what can happen is that individuals will be missing from Individual data in a subset even though they appear in Capture data (because their entry in Individual data was linked only to the population of origin, not the population it was in later in life).

For that reason, I do think we should have all unique IndvID-PopID combinations in Individual data.

(NOTE: You are right that information on individuals should be maximized, and that's why the summaries for writing Individual data should still be done across all entries of the same individual, irrespective of CapturePopID. In the case of PFN, I also labelled BroodIDs as "PopID-OriginalBroodID". That way, it's clear from one glance at Individual data that a certain individual was born in another population.)

Let's see what @LiamDBailey says. In the meantime, the most crucial is indeed to check carefully with the data owner.

cwtyson commented 3 years ago

@ChloeRN Makes sense about subsetting based on a population and then losing entries in the Individual data. I'll update the GLA_pipeline to include all unique PopID-IndvIDs in the Individual data. @StefanVriend Glad you brought this up!

StefanVriend commented 3 years ago

Thanks for checking, @ChloeRN. Indeed, multiple entries in Individual data solves subsetting issues. I wonder though how in such situations we are able to detect errors in PopID (i.e. individuals who haven't left a population but have wrongly been recorded as such).

I just had a thought about removing PopID entirely from Individual data, as I imagine individuals of future species which are far less philopatric moving between locations that different data owners manage. Then again, this comes with a whole new range of issues, I suppose...

Let's see what Liam says, and otherwise, we have nice topic for discussion during our next dev meeting.

ChloeRN commented 3 years ago

@StefanVriend I don't think we can (or should!) remove PopID from Individual data. Subsetting by population will not be possible anymore if we do that... Plus, personally, I really like that you can use Individual data to quickly check if individuals appear in multiple populations.

You are, however, correct that this means we cannot distinguish true migrants from individuals that have been accidentally assigned to the wrong population. At least not at the level of the checks (can be taken up with data owners during pipeline building though). I would think that such errors will be extremely rare though, so probably not an issue.

One way of dealing with it could be to flag all individuals that appear in multiple populations with a "warning" (instead of an error). That way, data owners could check them and confirm on a case-by-case basis whether or not this is a true migrant or an error.

StefanVriend commented 3 years ago

I think selecting all records from Capture data for a certain CapturePopID and then filtering the associated IndvIDs from Individual data is a possible workaround.

Anyway, I was just verbalizing my thoughts there for a second. Because I've always interpreted Individual data as a table where each individual has a single entry, I have to find ways to deal (emotionally) with these travelling individuals. 😭

I like your suggestion for dealing with individuals appearing in multiple populations!

LiamDBailey commented 3 years ago

Ok, sorry I'm joining this a bit late!

So I think there are two parts of this issue we need to consider. 1) How do we deal with immigrating individuals in Individual_data? 2) Given that immigration can occur, how do we check for incorrect cases of 'immigration' (i.e. wrong band identified in a different population)?

1) Individual_data

So @ChloeRN is right that we discussed creating a separate record in Individual_data for each unique IndvID/PopID combination. This has the benefit of allowing us to easily identify individuals occurring in multiple populations, but as @StefanVriend said, may undermine the logical purpose of the Individual data table (i.e. one row per individual). Removing PopID or the solution that @cwtyson and I discussed to use first PopID will both prevent effective subsetting so have to be discarded. Maybe there is a middle ground solution:

This solution will keep one row per individual in Individual_data, but still allow for effective subsetting. The only problem here is that a column that contains an open ended vector is more difficult to work with, especially once it's saved as a .csv.

2) Detecting errors

No matter how we answer 1) we will now always have the problem that immigration can occur and so it's impossible to say with certainty whether an individual seen with >1 PopID is an error. One solution could be:

Of course, 'possible' migration pathways are really just 'previously observed' migration pathways. This means we could be flagging things as errors that are still biologically possible. For example, you can clearly imagine migration between PEE and PEW (they are right next to each other!!), but that's never been observed AFAIK so would be considered 'impossible' 🤷.

Thoughts?

StefanVriend commented 3 years ago

Thanks @LiamDBailey!

Here are my thoughts:

  1. I really like the idea of adding a BirthPopID column. It's clean and straightforward. I am not sure though whether the AllPopIDs column allows for more effective subsetting than the suggestion I made in my previous post. I might miss something, but wouldn't something like IDs <- Capture_data %>% dplyr::filter(CapturePopID == "XXX") %>% dplyr::pull(IndvID) and then Individual_data %>% dplyr::filter(IndvID %in% IDs) to select individuals recorded in the population of interest ("XXX"), work? To still allow the detection of errors in this case, we could introduce a logical column MultiplePops (TRUE: more than one PopID; FALSE: one PopID) in Individual_data.

  2. I like your suggestion for the expansion of I2. I think it might be more difficult in practice, because we might not be able to assume that all IndvIDs are unique across populations. I guess that IDs that are ring numbers can safely assumed to be unique across populations, but when data owners use a different system to assign IDs, it might be problematic. I'm not sure how often data owners use something else than ring numbers though. I think a data frame of 'possible' migration pathways works fine. Even if we then detect new migration pathways, I think it is fine, because we can verify this with the data owner and then update the pipeline or add the record to the approved list.

ChloeRN commented 3 years ago
  1. I think it should be RingPopID and not BirthPopID since some birds are ringed as adults and may still migrate to another population later. I was thinking a bit along the same lines as Stefan to be honest, i.e. filtering a column with multiple populations in the same cells might be tedious.

The RingPopID column may be a good idea generally. But to be honest, I still think we should just keep multiple entries in individual data for individuals caught in several populations (and therefore also the PopID column). All workarounds seem to be suboptimal (i.e. the current suggestion 1 will still result in multiple entries of the same ID if - by chance - the same ring number was used on different birds in different populations), and I still don't think there really is a fundamental issue here. Maybe we can discuss during the next dev meeting.

StefanVriend commented 3 years ago

I think that multiple entries of the same ID is not an issue. Like you said, the same ID can be used for different birds in different populations. With a column like RingPopID (good name suggestion!), these individuals can be distinguished from one another. However, multiple entries of the same individual undermines the logical purpose of Individual_data to some degree.

But I agree, this is something we can discuss during the next meeting.

LiamDBailey commented 3 years ago

Good point @ChloeRN, RingPopID would be more accurate. Maybe I'm being a bit pedantic, but currently the 'Individual data' entry in the standard protocol describes it as:

information on individuals that is constant throughout their lifetime

If we follow this description then every unique individual should only ever have one row (i.e. one lifetime) and should not include information that could change over time (except in cases where we identify previous errors/mistakes in the data i.e. Sex_calculated). In this case, my original AllPopsID column idea should be avoided because an individual could move to a new population at any point, thus changing the column. Perhaps the ideal would be to only include RingPopID (constant over the lifetime) and subset using Capture data like @StefanVriend suggestion above.

RE: Check I2. @StefanVriend, you're right that duplicates could occur between PopID if different data owners use the same ID system, but I think these still need to be flagged. Whether duplication is due to a true 'error' (i.e. a typo/misread) or duplicate ringing systems, it will still cause a major problem for anybody wanting to analyse the data at an individual level and they need to be made aware of it. So, I think we flag these duplicates no matter what. In cases where we know that two ID systems overlap we may need to adjust the pipelines so that IndvID is actually PopID_IndvID. It would also be worth asking the advisory council whether they think it is possible/likely for duplicate rings to happen. Maybe we are worrying about this for no reason!

StefanVriend commented 3 years ago

Good points, @LiamDBailey.

I had a quick glance at possible duplicated IndvID in the current list of finished pipelines. Across all pipelines, there are 8730 IndvID (i.e., 0.87% of all individuals in the pipelines), which occur in 40 combinations of different PopID. For some combinations of PopID it is only one IndvID that is duplicated, for others there are several hundred or thousand duplicated IDs. So perhaps it occurs more often than we had hoped for!

Here's a code snippet, where all_pipelines is the output of run_pipelines():

dup_ind <- all_pipelines$Individual_data %>% 
  dplyr::group_by(IndvID) %>% 
  dplyr::filter(n() > 1) %>% 
  dplyr::summarise(Pops = paste(PopID, collapse = "-")) %>% 
  dplyr::count(Pops)

> sum(dup_ind$n)
[1] 8730

> nrow(dup_ind)
[1] 40

Edit: 2 out of 40 are duplicates within pipelines (3 IndvID in HOC and 13 in SSQ).

LiamDBailey commented 3 years ago

@StefanVriend good to know, so it is a problem we'll need to deal with. Is creating a PopID_IndvID column the best solution? Or are there other ways we can think of to get around this?

StefanVriend commented 3 years ago

Like you said earlier, PopID_IndvID works well when we have two monitoring schemes with similar ID systems, but how does that affect the cases where we actually have the same individual recorded in multiple populations? For instance, in the above subset, there were 63 instances of duplicates between Peerdsbos and Peerdsbos West, which are quite possibly referring to the same individuals.

It's also good to be aware that the above subset did not include duplicate IndvID within the same pipeline, because as we mentioned earlier in this conversation, they are often dealt with within the pipeline code.

EDIT: We could convert IndvID in all pipelines to PopID_IndvID. Then in the expanded version of check I2, we flag records where IndvID without the PopID-prefix are duplicated, and ask data owners to verify these records. Adding them to the approved-list will then prevent them from being flagged in future checks.

EDIT2: Isn't the PopID-prefix unnecessary if we add the column RingPopID?

LiamDBailey commented 3 years ago

Final decision after today's dev meeting: