AAFC-BICoE / dina-planning

AAFC-DINA planning repository
3 stars 2 forks source link

request for bulk data upload update/demo #222

Open heathercole opened 3 years ago

heathercole commented 3 years ago

Would someone please provide an update/demo of how we will be able to bulk-upload records into the management system (eg. from an excel spreadsheet).

(or time-line of when the update/demo could be provided)

wondering particularly how new geography/taxonomy will be imported.

thank you!

cgendreau commented 3 years ago

This is not ready at the moment but will be in a minimalist form. Complex structure should be transferred via proper data migration bulk upload import is limited to simple cases.

heathercole commented 3 years ago

thanks for the update, can you comment on whether upload from a spreadsheet would be considered basic vs. complex?

Even if all current data is migrated, there will still be future requirements for bulk-uploads (eg. when an external party provides data capture from AAFC specimens.)

dshorthouse commented 3 years ago

Cautionary note here re: data migration with respect to linkages to Open Street Map and Catalogue of Life. These functions are presently front-end code in DINA requiring a human in the loop.

cgendreau commented 3 years ago

Simple in the sense it can't have relationships. It will mostly fill the unstructured fields unless it can be done without any ambiguity. For example an eventDate provided in a structure way (ISO compliant) can be done but not an agent since we have no way to uniquely identify an agent from a spreadsheet.

heathercole commented 3 years ago

As long as there are tools which support specimen data being bulk-imported in some way, that should work.

The requirement is for CMs to be able to bulk-import specimen data which includes collectors/determiners/taxonomic identifications/location information/etc into the data management system in an effective way.

This will apply to legacy datasets, as well as new datasets provided from internal and external parties.

heathercole commented 3 years ago

@cgendreau can you expand a bit more, if we have 'people' (eg. collectors, determiners, catalogers) included in specimen records in spreadsheets, how will a bulk-import work?

@dshorthouse can you expand a bit more about bulk-imports needing a "human in the loop" for location and taxonomy data?

All of those data would typically be included in a dataset or spreadsheet that could contain hundreds or thousands of records, which is why the bulk-import requirement is a must-have.

Thanks!

dshorthouse commented 3 years ago

@heathercole re: "human-in-the-loop". Choosing a match to an item in an externally referenced resource as is the case with geography and nomenclature is front-end code. Such provision of choice likewise needs to be rolled into bulk import whereby "user chooses" when there is ambiguity.

Addendum: Because DINA is API-driven, a data manager can additionally write scripts to execute find_or_create-like methods against rows in a spreadsheet to ensure referential integrity among columns & no duplicate items are arbitrarily created.

heathercole commented 3 years ago

Thanks for expanding, depending on how that is implemented, it may not be feasible to require an importer to 'choose' for each record, or even each unique agent. It could be a tremendous burden compared to how similar imports are managed in our current systems. It would be great to see a demo of this to better understand how the requirement will be implemented. If the scripts you describe are stand-alone tools Collection Managers can re-use as needed, or incorporated into DINA, that will be great, as we need to be able to do these kind of bulk-imports without relying on any 3rd parties outside of the collection's group.

dshorthouse commented 3 years ago

Thanks for expanding, depending on how that is implemented, it may not be feasible to require an importer to 'choose' for each record, or even each unique agent.

Yes, and there would be consequences. This is why other systems like EMu have an explosion of ambiguous entries in their Parties module (~ our Agent/Person) because the system may opaquely create new entries instead of attempting to disambiguate on entry or prompting the user to make a selection ("Did you mean this John Smith or that John Smith?"). And so bulk import may in fact require a multistep process in some instances when there is an expectation that entities will be properly linked on entry (eg import collecting event details first, gather their internal identifiers, then import material samples and reuse those previously acquired internal identifiers for their collecting events such that linkages are assured without creating a heap of duplicate collecting events).

dshorthouse commented 3 years ago

As this ticket was a request for a demo, I suggest it be specifically scoped to bulk update of material samples already present in the system. Bulk insert across the gamut of all objects and relationships is an entirely different beast with its own constraints and challenges that require very specific use cases. The massive range of these bulk insert use cases might be better accommodated through scripts written and maintained not by 3rd parties but by data managers & data scientists in the collection's group.