ArctosDB / documentation-wiki

Arctos Documentation and How-To Guides
https://handbook.arctosdb.org
GNU General Public License v3.0
13 stars 13 forks source link

Handling legacy parasites #178

Open Jegelewicz opened 4 years ago

Jegelewicz commented 4 years ago

Hi all,

This question is probably best pitched to Mariel, but I figured I would seek the wisdom of the AWG while I'm asking. We've got a pile of legacy parasite material that is currently handled as parts of mammals in NMU:Mamm and we want to bring this into the NMU:Para database. Before we get neck deep in this I wonder what the group recommends as the best strategy for this. Do we remove these "parts" from the mammal records entirely and re-accession them in NMU:Para with relationships defined back to the host record? Does it make sense to not remove the parasite parts from the host records if we are putting these into NMU:Para? Do folks have recommendations for how to most efficiently proceed? Everything needs to be re-barcoded and organized into new boxes, so vial handling is going to be happening regardless if that makes things easier. Any recommendations from the pros would be much appreciated!

Thanks!

Kurt

Jegelewicz commented 4 years ago

Hi Kurt, We are still working out the details at MSB. We leave the mammal parts, and then catalog as parasite catalog records. Create a loan to transfer the parts to your parasite division. I recommend rebarcoding on transfer, but recording the parent barcode in parasite part remarks. If your parts are unsorted, I can show you how to split them.

Mariel

Make sure you use the same collecting event ID to catalog the parasite records. Also create the relationship to the host catnum as OTHER ID with relationship parasite of during bulkloaded. Or I can show you how to use single record data entry to do this using the Pull function.

Jegelewicz commented 4 years ago

e have the same issue in that our bird parasites are cataloged as bird parts, and may do the same thing.

Mariel - Is there documentation for the pull function? I saw you do it quickly once, but it would be good to add to the data entry 'how to' and/or do a short video tutorial.

Thanks.

Carla

Jegelewicz commented 4 years ago

I want to say that leaving the host parts is "wrong," or at least potentially-confusing, and it probably is, but I'm not sure any evil can extend beyond there. I probably wouldn't do it that way, but if you have some reason to do so I don't think anyone will whine about it too much either.

The relationships - and those really need to use GUIDs or an equally-stable and equally-attached identifier - is the important part. Do that properly, and we can deal with strangeness in everything else.

I can't imagine a scenario where re-barcoding could be necessary, and I can think of lots of ways it'd be confusing. If you can elaborate on that I might be able to suggest some tools or approaches that will make this easier.

If those tools don't exists (and the rebarcoding thing doesn't turn out to be fatal!) then please talk to us before pushing some button a million times - we should be able to generate a pretty good "seed" parasite from a host part and perhaps an identification. ("A parasite, obviously!" is good enough if necessary....). I think it's extremely common to "catalog" (perhaps via misspelled remarks) parasites with their hosts, they're essentially impossible to find so not useful there, they seem to get a lot of use in parasite collections, I'd certainly advocate for prioritizing tools to make that happen.

Sharing a collecting_event_id isn't necessary, but it definitely makes things both easier to manage (changing the parasite changes the host and vice-versa) and easier to understand ("remote parasitism" is probably pretty easy to find in our data if you want to go looking!). Highly recommended if at all possible, even if it's not the accepted determination.

It seems this would ideally result in two levels of documentation (or well-explained requests for documentation) - a high-level "so you want to catalog some parasites" sort of thing to help understand what tools are available, and then individual "how to catalog a parasite using DataEntry-->CloneWithBarcode" type docs.

Dusty

Jegelewicz commented 4 years ago

I agree with everything Dusty says above. It goes along with his constant admonishment to "catalog the item of interest". Mixing up taxa in a record seems like a very confusing thing to do, but as he says, if it makes you happy, it probably isn't the worst possible thing. But it does mean twice the work because you could just catalog them separately from the get-go (even with legacy stuff - after all you are entering the data in a new place and there is nothing that says you have to re-create the issues in your previous data and then clean them up after they have been entered in your new system). All it takes is copying the mammal record, changing the taxonomy and part, and adding the otherIDs. No need to create something, make a loan, re-barcode stuff, sit for how ever long pulling mammal records to create parasite records, etc.

And with regard to pulling - my suggestion is to create a search that gets all your mammal records that include parasites, download, make the changes needed to transform the mammal records into parasite records, and bulkload the parasites.

dustymc commented 4 years ago

create a search that gets all your mammal records that include parasites, download, make the changes needed to transform the mammal records into parasite records, and bulkload the parasites.

Combining this with leaving the parts in the 'host' record seems like a really great recipe for ending up with both a bunch of duplicates, and (probably because you were trying to avoid more duplicates) a bunch of parasites still "cataloged" only as "& soem wurms" in some remarks field.

Denormalization is just about always bad....

campmlc commented 4 years ago

I disagree very strongly about deleting parasite data from host records under any circumstances whatsoever. To capture both host and parasite data, we are dealing with a logistically complex situation that requires cross-divisional and in some cases, cross-institutional coordination over multiple time lags involving different staff, divisional and institutional policies, workflows, and curator preferences. The potential for error and data divergence is huge. Having dealt with this over the last 9 years, I can attest that this is the most complex data management situation I personally have faced, and despite concerted efforts still have not resolved. At the MSB, we are dealing with one of the largest legacy collections and the highest volume of new collection for mammals/mammalian parasites in the world right now, and while it is easy to make pronouncements about how this all should theoretcially be done the reality is that we do not have the staff or resources or workflows designated to support this. It is all ad hoc. I would love to bring curators and staff and field researchers from all the Arctos institutions that manage parasites from field to archive in the same room for a week, along with programming and interface support - perhaps then we might make some headway. In the meantime, the steps I outlined are what works with least amount of data loss given the staffing and curatorial resources we have available. Some observations: 1) The collection of the parasites and hosts occur simultaneously, typically from funding supporting host collection, by the host collection, in collaboration with parasitologists but not always with a parasite collection. This means the host institution, curator, and collections staff are the primary source of data, field notes, georeferencing, etc. as well as the source of staffing for managing these data and ensuring the continuity of data between parasites and host. These staff have their own workflows and data permissions that do not typically include shared access with the parasite collection. The initial cataloging is therefore as host parts. 2) For this reason, parasites and hosts should, whenever possible, share the same collecting event ID if both collections are in Arctos. This does not always happen, but there is no need for these collecting events to diverge. We have the tools. They do so only because of differences in data management and cataloging practices and workflow timing issues between institutions/divisions. The advantage of sharing an ID allows yet another means of finding all parasites collected with the same host(s) from the same locality, not a trivial matter (see below). 3) Before parasites are transferred from the host collection to the parasite collection by transfer of custody or loan, they are frequently loaned out by the host curator to collaborators for identification and/or research. During this process, specimen lots are frequently split and re-barcoded, a logistical nightmare given our current tools that can lead to significant error and data loss. Only by keeping the data linkage to the original host part lot barcode can this by avoided - we need better tools to track the parent/child part relationship through multiple chains. Also, since the funding for the collection of these samples came from the host collection institution and associated researchers, the host curator needs to be able to approve these loans and get credit for resulting publications. This can only happen if the samples are still host parts, not parasite catalog items.When the now-identified parasites are returned, they are transferred to the parasite division and re-cataloged. Again, we need to maintain the data linkage from these child parts now re-loaned and re-cataloged as parasite catnums back to the original host record for data verification. Any time this parent/child linkage is lost, given the complexity of the data chain from field to this point, there has and will continue to be data loss. 4) We do not currently have adequate tools to make parasite/host relationships searchable and manageable. We cannot search on related items/related item ID without crashing Arctos. I have raised this issue repeatedly. Perhaps someday there might be a solution for this, but it is currently non-functional. This is unfortunate, because this search is the most powerful tool we or anyone else in the community can offer for making host/parasite relationships discoverable across taxonomic and geographic and temporal boundaries. 5) Instead, parasites cataloged as parts of hosts can be searched via part search, but this is an excruciating process involving searching every possible part name of every class of intestinal parasite (not to mention preservation type), and even worse, in some cases critical information as to the parasite ID or disposition is in part remarks. To access this information, including the barcode and remarks, the only option is to downoad part detail as JSON which cannot be unconcatenated into a usable and coherent format if more than one record at a time is downloaded - none of the columns align. Again, another repeated issue. 6) Finally, some legacy parasite data is only found in the host record under attributes and attribute remarks (examined for endo/ectoparasites, parasites found etc), or in specimen remarks, or not at all because of previous and current limitations on the number of parts and attributes that can be bulkloaded (10), which meant that, when host collections prioritize their own attributes and parts first, parasite data get left off and lost. Tools to add extra parts and attributes have helped somewhat, but these are not user friendly or well documented and add an extra burden to host collections staff to manage and load.

So yes, cataloging parasites and hosts immediately from point of field collection would be ideal. Let's get Arctos Air back up and running. Let's add funding to field research for dedicated field parasitologists who are also paid to manage the same specimens once they are brought into the museum. As long as field work and museum curation remain underfunded and lacking in paid dedicated trained staff and in-house data managers to handle these complex scenarios, this will continue to be a very challenging situation that deserves more of our time and attention to develop new strategies and clean up the legacy.

On Mon, Jun 1, 2020 at 9:21 AM dustymc notifications@github.com wrote:

  • [EXTERNAL]*

create a search that gets all your mammal records that include parasites, download, make the changes needed to transform the mammal records into parasite records, and bulkload the parasites.

Combining this with leaving the parts in the 'host' record seems like a really great recipe for ending up with both a bunch of duplicates, and (probably because you were trying to avoid more duplicates) a bunch of parasites still "cataloged" only as "& soem wurms" in some remarks field.

Denormalization is just about always bad....

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2712#issuecomment-636919997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBE4BMRD5ERMIWIBJILRUPBRDANCNFSM4NP2LQTQ .

dustymc commented 4 years ago

I would love to bring curators and staff and field researchers from all the Arctos institutions that manage parasites from field to archive in the same room for a week, along with programming and interface support - perhaps then we might make some headway.

I think this would be very useful, but I wouldn't preemptively limit participation. Useful solutions very often come from across some "traditional" boundary. For example, "specimen lots are frequently split and re-barcoded, a logistical nightmare" sounds like just another Tuesday in an insect collection - my semi-vague understanding is that they do this as a matter of practice and it creates better data for less work when planned for.

We do not currently have adequate tools

That landscape has changed in the last 48 hours (or year, or hopefully soon will, depending on how you want to look at it).

Arctos Air

That's essentially just antique software at this point, but the idea of a "portable" data entry option remains viable. https://github.com/ArctosDB/arctos/issues/2562 and https://github.com/ArctosDB/arctos/issues/2178 would make it more fun.

Jegelewicz commented 1 year ago

I don't know what we are supposed to How To here - it seems there is no consensus.