AAFC-BICoE / dina-planning

AAFC-DINA planning repository
3 stars 2 forks source link

Collecting Event: Duplicates and future use #285

Open michellelocke opened 2 years ago

michellelocke commented 2 years ago

Ex: Events:

These two collecting events are identical. Specimens are from the same collecting event but were added independently as Material Samples, mimicking how we typically input legacy data.

Legacy specimens from a single collecting event will be spread out throughout the collection. Each specimen will be entered at a unique date and time. As we do mass data entry, it is too difficult/time consuming to ask digitizers to look up each collecting event to see if it already exists. It is also very easy for digitizers to make a mistake in choosing a Collecting Event, especially if there are many dates for a single location.

This means that each time a collection event is entered through a Material Sample, it is entered as a unique event. When we transfer the CNC DB to DINA, presumably each record will get a unique Collecting Event number when it is created (except perhaps the system will be smart enough to deal with the comparatively small number of specimens that already have a Collection Event Number in our system). As the CNC is currently sitting at 2.5 million records, this means at the time we start using DINA, there will already be 2.5 million Collecting Events that are entered into the Collecting Event System.

This does not seem like a practical way to handle this and I fear that with so many entries we are going to struggle to use Collecting Event in a meaningful way.

There is no comprehensive search for Collecting Event. The Collecting Event list/search results do not show any meaningful data (except date). When you look at a Collecting Event record it is not easily clear what the data of the event is as the user is overwhelmed with fields (possibly many that are blank).

This all makes Collecting Event difficult to use and I fear it will only be harder to use with more records.

dshorthouse commented 2 years ago

There are three parts to this as I see it:

  1. How do we prevent duplicates on entry? (search)
  2. How do we prevent duplicates during migration? (models are different)
  3. If duplicates are created, how do we resolve them? (merge, see #220)

If there are humanly obvious duplicates in the source that require resolution prior to or during migration (2 above) then we'll need consistent, non-overlapping signals that permit the correct application of rules that cluster items like collecting events. Otherwise yes, those very same duplicates will exist as duplicates in DINA. To a human they are obvious duplicates but to DINA they are different. The ideal solution is to deduplicate in the CNC database because it will be in operational use throughout phases of migration. Interim edits to records during iterative migration cycles could play against whatever in-memory logic once correctly clustered those data objects like collecting events & it all unravels into spaghetti. The only magic we can use is whatever us as migrators and data committee members can put into the hat & to both decide what is an acceptable rabbit.

If we do in fact wish to have 2.5 million Collecting Events because there are no no duplicates in the CNC db then ignore the above. This is indeed a search & filter problem plus a UI problem. I suggest a new ticket to specify how search against existing Collecting Events should work & what fields should be used singly or in combination (and how). These might not be limited to field-level values depending on your scenarios but could include more nuanced inferences, a slippy map to draw a rough query polygon, a date range slider, other?