Basis of record appears as UNKNOWN supplied basis "Genomic DNA"

AtlasOfLivingAustralia / la-pipelines

Living Atlas Pipelines extensions

3 stars 4 forks source link

Basis of record appears as UNKNOWN supplied basis "Genomic DNA" #397

Closed javier-molina closed 3 years ago

javier-molina commented 3 years ago

This comes from point 9 on #309

From record view

Basis of record appears as UNKNOWN supplied basis "Genomic DNA"

This is a result of GBIF Pipelines not using a basis of record vocabulary as extended as the one in Biocache Store. Whether all the entries should be there it is still to be decided but an initial suggestion is here gbif/gbif-api#84. ˜˜It is not anticipated this is going to be addressed for our v1 release.˜˜

DM Team still have to decide whether DNA collections such as Barcode of Life need to be updated to a different basis of record currently defined in dwc and use additional fields to convey it is a DNA record.

timrobertson100 commented 3 years ago

CC @tucotuco for info (relates to our ongoing BOR discussions in Darwin Core)

elywallis commented 3 years ago

After talking with @javier-molina this morning, we agreed that I would describe the user perspective here. We might not be able to adequately address this prior to release but I just want to advocate that when we make decisions on data logic alone, there are consequences for users.

Problem:

ALA has allowed several non-standard terms to be used in BasisOfRecord. Relevant to this discussion are the terms "Genomic DNA" and "environmental DNA"
these terms don't map to a value used by GBIF. Initial suggestion was to map them to UNKNOWN but that was revised yesterday to BasisOfRecord = MaterialSample (a decision that I agree with)
BasisOfRecord = MaterialSample is already used by the collections community to indicate records that are tissue samples (e.g. heart, liver, lung) held in collections/biobanks and available for DNA sequencing.
One of the data pre-filters uses the value environmentalDNA as a default filter to exclude eDNA records (plus fossil records), This pre-filter will break when the change is made (just noting that)
users (particularly from the collections community) have requested these pre-filters because (depending on how they're loaded) eDNA records threaten to swamp all the other records in the ALA and there are issues around validity of the identifications.

Problem extent:

BoR = genomic DNA = 154,479 records across 2 data resources
BoR = environmental DNA = 1,175,825 records across 6 data resources
BoR = MaterialSample = 211,905 records across 5 data resources

User perspective

Collections staff often receive requests from researchers enquiring about tissue samples that they can use for DNA sequencing/genomics studies
Currently the 3 collections providing tissue data can simply point enquirers to ALA, tell them to facet on BasisOfRecord = Material Sample and they can see tissues. A second field (Preparations) is used to indicate what type of tissue the record pertains to - e.g. heart, lung, muscle, toe clip etc.
Preparations is a free text soup, so is not available in the facet options, nor can it be searched on in the Advanced Search. So there's a pre-existing issue that once a researcher has found all the tissue data, they can't easily further refine their query to give them, say, just the "heart" tissue. (Like I say, known pre-existing issue - but may never been documented)
Once the Pipelines change is made to map BoR = genomic DNA and BoR = environmental DNA, an additional 1.33M records will show up when a user filters on BoR = MaterialSample. There is no additional field (like Preparations) that can be used to further filter the data to exclude eDNA and genomic DNA records.

Who is most likely to notice the change to the BasisOfRecord mapping and be inconvenienced by it?

There are 3 institutions who have put tissue data into ALA - Museums Victoria, South Australian Museum, CSIRO Australian National Wildlife Collection.
One data resource provides the vast bulk of the current eDNA records - Biome of Australian Soil Environments (1,170 of the 1,175M records)

Short term suggestions:

Given that there is unlikely to be a good user-focused data solution to this problem prior to Pipelines going live, ensure that the data providers most affected by this change (the 4 above) receive some specific communications to notify and prepare them
Suggest that the collections let users know to filter on both BasisOfRecord = MaterialSample plus Institution OR Data resource to narrow down the options.
The data providers/data resources supplying MaterialSample, environmentalDNA and genomicDNA are currently mutually exclusive. Perhaps we can set up some custom filters in the interim based on DataResource that will still allow the different groups to be viewed separately?
I suggest adding "environmental DNA" (or eDNA) and "genomic DNA" to the Preparations field for relevant records (yes, I know that means reprocessing 1.33M records). That way, if a user downloads all the records for BoR = material sample, they can at least do some useful analysis based on the Preparations field (i.e. exclude records once they're downloaded)

Acknowledging longer term discussions

TDWG and the DwC maintenance group are already discussing the fate of BasisOfRecord exhaustively
GBIF are also thinking creatively about how groups of records might be tagged for easy inclusion/exclusion from queries.

There are likely to be longer term changes as a result of these discussions but right now I think we need to do something in the short term to prepare data providers for this change which will have the effect of making their data harder to find, not easier.

m-hope commented 3 years ago

Sorry, I've been trying to get to this all day...

I think Ely has covered most of the concerns... the biggest one being it is vital that users have a simple mechanism to be able to separate the multiple different record types that are now being lumped under MaterialSample. I reinforce that this is a backward step (in terms of data quality if nothing else) as they are all completely different types of data with different uses (I am happy to go into this in more detail, but for the sake of time will keep it short).

We must retain (even if it is in a field that is not DwC) that a record is eDNA or genomic DNA so that in the future, when GBIF/TDWG/whoever gets their act together to work out a solution for the new data types out there now that weren't around when the BoR "standard" was put in place, we can put it in place without having to reverse engineer the info.

While I don't think this is a long term solution, I have to admit that I like the proposal by Tim in this GBIF issue (https://github.com/gbif/registry/issues/247) of using a "category" field as it is at least a practical approach and one that perhaps we should adopt sooner rather than later.

timrobertson100 commented 3 years ago

Thanks all for acknowledging that it's not as easy as it seems given the desire to follow the DwC standard.

I suggest adding "environmental DNA" (or eDNA) and "genomic DNA" to the Preparations field for relevant records (yes, I know that means reprocessing 1.33M records)

Is this wise? preparations captures information on the processes and preservations applied to specimens, and I suspect we aim to clean up content in that area with recommended vocabularies. For eDNA related studies I'd expect information about things like chemical treatments of water/soil samples etc rather than simply "eDNA". It looks slightly to just be moving the problem from basisOfRecord into another term.

If gbif/registry#247 is considered a sensible way to tackle this problem it could be implemented within weeks (in GBIF and ALA collectory).

In the meantime, would it perhaps be better for ALA to simply continue to use those two values in the BOR field for all the user-facing reasons already explained?

From an implementation point of view, I'd assume it would either be a patched enum or an ALABasisOfRecord extends BasisOfRecord, and an additional map() transform overwriting the common parser values.

While I don't think this is a long term solution, I have to admit that I like the proposal by Tim in this GBIF issue (gbif/registry#247) of using a "category" field

It may just be a hunch/intuition but if you have reasoning could you elaborate on why it's not long term on the issue please @m-hope? We might be overlooking something that you see. The intention was to decouple the data model (i.e. classes of concept) from the way many people want to search for data (the nature of the dataset/community from which it came). This would allow GBIF / ALA the ability to react to user needs quickly and noninvasively as it doesn't require people to change how they structure the data. It does mean we (data managers at GBIF + nodes) take some responsibility for categorizing datasets, but the reality is we're doing that anyway in the many ad-hoc reports we run.

Edited to add: The approach is similar to how we codify the type, relevance, and topic for literature which is manageable.

elywallis commented 3 years ago

Is this wise?

No, it's not! It's just the best I could come up with so that when all those records are indistinguishable according to BoR, if someone downloads a bunch of records they at least have something in a field that can be used to parse out the records. I certainly won't argue for this suggestion as being even close to being the best thing to do!

It looks slightly to just be moving the problem from basisOfRecord into another term.

Well, yes, it sort of is. Again, thinking of it as an interim measure but you've already moved to better suggestions so feel free to ditch mine.

If gbif/registry#247 is considered a sensible way to tackle this problem it could be implemented within weeks (in GBIF and ALA collectory).

The principle gets a big tick from me - but implemented in the Collectory? Hmm, I thought you meant implemented where I'm searching for occurrence records - I didn't anticipate having to search in the Collectory first and then showing occurrence records as a second step.

If the main "unit" for uploads is the data resource then that will be really problematic for the collections (who we're trying to help). Taking Museums Victoria as an example:

the data resource is https://collections.ala.org.au/public/show/dr342 but that only gives you one pointer to all the records.
the institution is https://collections.ala.org.au/public/show/in16 which contains links to all the different collections within that institution. This is where you'd need to tag different collections as preserved specimens or tissues. So if you put tags at the data resource level then you're no better off because you can't be specific enough.

Also, going via the Collectory means, as a user, I can only see the records for a single institution. What if I want to search taxonomically? e.g. "give me all the tissue samples for genus Macropus regardless of which institution they're held in?" Currently, that's super easy to do - query for genus = Macropus; filter on BasisOfRecord = MaterialSample I assumed that the tags would give me a similar ease - query for genus = Macropus; pick the tag "tissue sample" (or some variant) Is that not what you were thinking @timrobertson100 ? I'll re-read gbif/registry#247 to see if I can pick up that nuance there and re-comment on that thread.

In the meantime, would it perhaps be better for ALA to simply continue to use those two values in the BOR field for all the user-facing reasons already explained?

Again, no argument from me on that one as an interim solution

@m-hope and I had a discussion this afternoon - in the absence of a comment from him, one of his issues is more about whether eDNA should even be placed in MaterialSample? His contention is that it should go under Observations > Machine Observations. As to why he thinks gbif/registry#247 isn't a long term solution - we'll have to wait for him to comment on that.

timrobertson100 commented 3 years ago

Hmm, I thought you meant implemented where I'm searching for occurrence records... I assumed that the tags would give me a similar ease - query for genus = Macropus; pick the tag "tissue sample" (or some variant)

Sorry for the misunderstanding. It would indeed be on occurrence records, but we'd manage the tagging at dataset/resource level and therefore be stored as metadata in registry/collectory. Those tags would be copied onto each record on processing, and then available as search facets in the UI at record level. In GBIF, additions to the vocabulary would be picked up automatically in pipeline processing and the UI, since they integrate with the vocabulary server (not sure how ALA UI would behave)

elywallis commented 3 years ago

Still not quite convinced @timrobertson100
Still using Museums Victoria as an example . The record in GBIF that matches the ALA data resource (https://collections.ala.org.au/public/show/dr342) is this one: https://www.gbif.org/dataset/39905320-6c8a-11de-8226-b8a03c50a862 That dataset contains both "specimen" records and tissue samples. In ALA these are split out into "collections" but not in GBIF. So if the tag for tissue samples is attached to the dataset record then it's too coarse because only a subset of records are for tissue samples. Maybe I still misunderstand?

timrobertson100 commented 3 years ago

Maybe I still misunderstand?

Ah, thanks for repeating. You don't misunderstand.

Edited to add: It's reasonable to consider a future where publishers do provide this on a record by record level though, but that might be a slower adoption process since it brings in a new term that ideally would be in DwC.

In ALA these are split out into "collections" but not in GBIF

We didn't but we do now (e.g. MV collections here) - I'll provide detail on that in email to avoid me drifting off topic.

m-hope commented 3 years ago

It may just be a hunch/intuition but if you have reasoning could you elaborate on why it's not long term on the issue please @m-hope? We might be overlooking something that you see. The intention was to decouple the data model (i.e. classes of concept) from the way many people want to search for data (the nature of the dataset/community from which it came). This would allow GBIF / ALA the ability to react to user needs quickly and noninvasively as it doesn't require people to change how they structure the data. It does mean we (data managers at GBIF + nodes) take some responsibility for categorizing datasets, but the reality is we're doing that anyway in the many ad-hoc reports we run.

Sorry, @timrobertson100 , poor choice of words... I agree that the "category" is an excellent way to decouple our management of data from the way users want to search it and in that respect it is something that should be implemented and will hopefully grow... what I meant by "not being a long term solution" is that this mechanism can not fix the fundamental problems with BoR and shouldn't be considered as a solution for that, perhaps in the short term, but certainly not in the long term. In reading @tucotuco's explanation of BoR origins and that "PreservedSpecimen, MachineObservation, etc are subtypes of "Occurrence", it occurred to me that by lumping all DNA based data into one pot, we are doing the different aspects of it a disservice. In the eDNA guidelines workshop last year we decided that it was important enough to separate at least five different types/categories of DNA derived data, each with their own metadata and usage. However, in the current BoR setup, most (if not all) of these different data types are lumped in with tissue samples under MaterialSample (and as you point out the only field which really could differentiate them is "preparations" which apart from being a free text field (which is worse than the current BoR usage) has a slightly different purpose).

To elaborate on what @elywallis mentioned about me thinking eDNA data in particular should be considered MachineObservation data... while I'm no expert on DwC or microbiology for that matter, my feeling is that in theory, GenomicDNA is a sample taken from a specimen and in many ways can be treated as a form of "tissue", ie: MaterialSample may be appropriate. However, we (as in GBIF and ALA) are treating eDNA data as observations/occurrences. The evidence of which is a sequence detected in the environment which is processed via a machine. In much the same way that a camera trap produces a picture so that a human can later say an organism of a particular species was at that location at the time, a sequencer processes an environmental sample and produces a "picture" (ie: string of DNA codes) that can also be interpreted by a human to say that an organism of a particular species was at that location at that time. This is a rambling way to say that I don't believe lumping all DNA derived data as MaterialSample is the right solution and while yes, eDNA sequence data is technically a string of letters, the same as GenomicDNA, and is a "sample" of something else, because we are treating the data differently (ie: as occurrences), we shouldn't be categorising the data in the same way.

javier-molina commented 3 years ago

Thanks @elywallis, @timrobertson100 and @m-hope for your contributions.

While all are valid points and I appreciate Ely explaining thoroughly the implications from the user and usability perspective we need to keep this manageable to the point a solution for it does not blow the release for stage 1 or a second iteration soon after that.

Overall whatever we decide we need to make sure our General profile has a compatible filter for the current criteria:

Screen Shot 2021-06-07 at 10 14 27 am

1) Whether that can be achieved with additional fields such as Preparations and BasisOfRecord = MaterialSample and without affecting existing collections already using MaterialSample is something that needs to be confirmed.

2) The other option that we (the infrastructure upgrade team) discarded originally, mostly based on data and alignment with GBIF pipelines existing processing, is bringing back the ALA specific vocabulary extensions for BoR such as "Environmental DNA" and "Genomic DNA".

I'd like to have @peggynewman and @djtfmartin on the above when they have a chance.

Finally the suggestions for a more extensive solution to the issue like tags or changes to GBIF Registry or ALA Collectory sound very promising I would prefer to continue those in their respective project and possibly link to this GH issue. Personally I wouldn't want us to spend more effort extending the collectory if we will be soon starting to plan how to approach the adoption of the GBIF registry.

nielsklazenga commented 3 years ago

I think MaterialSample as a basisOfRecord, especially for eDNA records, is really problematic, as, unlike the other objects for basisOfRecord, MaterialSample is also a "proper" Darwin Core class with a MaterialSampleID and potentially other properties. In most cases, I think the basisOfRecord for Material Samples is actually PreservedSpecimen or LivingSpecimen.

I do not know much about environmental DNA, but I think it does not fit in any of the available basisOfRecord classes. As other people have suggested, I think it would be good to have another basisOfRecord class, like EnvironmentalSample (or EnvironmentalDNA if that is thought too broad), to distinguish these samples from samples that are taken from preserved or living specimens (tissues, molecular isolates etc.). It would be good if this got into Darwin Core eventually, but given the discussion around MaterialSample that is going there, I do not see that happen anytime soon (so I suggest not to wait for that).

peggynewman commented 3 years ago

Hi all, especially @elywallis @m-hope

Some partial solutions have been put in place for 1.0, which we can adapt.

To push the BASE dataset in, we've changed BoR to Material Sample. This dataset will grow and change over time, and we have a DwCA that we can mark up. This is in pipelines now.

@djtfmartin has done some cleaning up in the preparations field and mapped what's in the field now to a cleaner set of values. It's got the eDNA and genomicDNA values in there. Have a look here: https://docs.google.com/spreadsheets/d/1jkJJh5aoY8Qs1tBeJ4GMf-2LQ_UiJ-V3bgp1TItVlYc/edit?usp=sharing

This is handled in dictionary files within the ALA, and doesn't happen in GBIF as well at this stage.

This result means that there is both a raw and processed preparations field. It's a multi value field, so if there is any match to that processed value in the record, the facet should still be flexible enough to pick it up. We can test that.

What are your thoughts about this approach?

javier-molina commented 3 years ago

All, @elywallis, @m-hope

Further to Peggy's comment above, this is work in progress but the expectation is that preparation fields can be used to filter eDNA from non eDNA records that are also Material Sample hence retaining the ability for an end user to issue specific searches.

At the moment, while still not visible in the UI, it is possible to filter by preparations like in the search below:

https://biocache-test.ala.org.au/occurrences/search?q=preparations%3ATissue

The above even works in current production.

FYA Ely, Michael While Dave has done the initial cleanup for preparations to turn it into a managed vocabulary, quoting him, that is a guess work on his part hence we need someone with more knowledge in natural collections to validate/fix/expand the current spreadsheet https://github.com/AtlasOfLivingAustralia/la-pipelines/issues/397#:~:text=https%3A//docs.google.com/spreadsheets/d/1jkJJh5aoY8Qs1tBeJ4GMf-2LQ_UiJ-V3bgp1TItVlYc/edit%3Fusp%3Dsharing

Could you help with that?

Thanks

djtfmartin commented 3 years ago

Draft PR is here: https://github.com/gbif/pipelines/pull/547

Will convert from Draft once we have a review of the vocab file.

m-hope commented 3 years ago

I really don't want to be seen as a stick in the mud on this, but having just reviewed the preparations mapping file, I am struggling with this, particularly as we all seem to be busy with the EoFY and a heap of other priority issues going on right now.

If this mapped field is intended to replace the preparations field then I cannot see it working for users. This field is intentionally multi-value to describe how a particular tissue sample has been treated... all of the information provided (which tissues are stored, how they have been prepared/stored, whether they have been fixed with formalin, are now in ethanol, concentrations of the chemicals, etc) is relevant for those people who are going to be interested in this. Trying to reduce multiple different values to a single mapped vocab is going to be so lossy as to actually hinder searching. For example, I am currently working with a group at ANWC who want to identify particular historic specimens that have tissue that has specifically been fixed in formalin. So far I have been able to do this using the preparations field, but replacing this with a single "Wet" value (not to mention where the value "formalin" has been ignored and replaced by "Skin" for example) would produce so many false positives and false negatives that it wouldn't work.

Even if we went with a multi-value vocab, with my (limited) experience in this area, I wouldn't be comfortable trying to accurately map this information. We would definitely need to seek expert advice, but it would need to be from many relevant areas and not just one or two people. There are abreviations in there that are meaningful for people in that particular field of research but not others. This is going to be a mammoth task and shouldn't be being attempted at the eleventh hour.

If, on the other hand, this field is going to be a separate field included in addition to the preparations field, then why don't we prototype the category (or some other non-DwC) field and just use that to categorise tissues, genomicDNA, eDNA (and any other relevant tissue type) rather than trying to shoehorn this into an existing field that really has a different purpose.

I have absolutely no problems admitting that the current situation with eDNA and BoR is not ideal and needs to be rectified. However, coming up with last minute solutions that are going to seriously affect the data and be so lossy as to make it effectively useless to users is not the way we should be tackling this.

Can I suggest that we keep things as they are in the short term, but take the time to explore all options and come up with a solid, workable and DcW compatible solution when we all have the ability to focus on it properly.

peggynewman commented 3 years ago

This will be an interpreted preparations field, meaning that the raw preparations field will be preserved and available (see original vs processed values). At the moment, neither field is on the Customise Facet interface, but can be added. Do you think that goes some way to resolving your issues, if it is actually an additional field? In that sense it shouldn't be lossy. Both fields are indexed. With the formalin problem it would be great if you could bring that knowledge to the spreadsheet. I agree too that this is more of a long haul problem. When we do want to use something of a vocabulary or make calls like this it would be great to be able to decide on things via committee rather than quick fixes.

elywallis commented 3 years ago

Firstly thanks for making an attempt to address this issue. My comments:

This issue arose because ALA has custom values in the BoR field (environmentalDNA and genomicDNA) that were going to be mapped to "Unknown" on the first pass and "MaterialSample" on the second look. But let's look at the reason those custom values were there in the first place - to allow users to very easily filter out or in the eDNA records. We need to ask ourselves if the solution proposed keeps the very easy criterion. Sadly it doesn't.
I thought that after lengthy discussion an agreement had been reached with Tim R to leave the custom values as is for now - simply because the MaterialSample/Preparations solution was too difficult - until a better/longer term solution could be developed. And that we'd just leave the values in BoR as they are.
Either that message wasn't received or wasn't accepted so now we're back to trying to cram Preparations into a controlled vocabulary at the last minute. I'm sorry but I agree with Michael that this really isn't an optimal solution. Trying to clean up a controlled list that's gotten a bit out of hand is one thing - trying to create a controlled list out of a multi-value free text field is going to take a serious amount of work. To do this properly also requires us to address the fact that Preparations is an awful field because it mixes what the specimen is with how it was preserved, and how it's stored gets thrown in there as well. And then people start using abbreviations known only to them or their discipline (IV.1, ungual) leading to a very large number of the values being mapped to "unknown" - basically rendering them unsearchable.

I'd like to get back to users. Can we give some thought to the stated user needs (and why we created the custom BoR terms in the first place)

environmentalDNA is still relatively new (to ALA at least) and generates a potentially huge number of records for each study. We wanted an easy way to filter OUT the eDNA records. If they're mapped to BoR=MaterialSample then a second field is needed. A value in Preparations is proposed "environmentalDNA". Has anyone checked if any of the current eDNA records already have something in Preparations that would be overwritten by applying new data during processing?
Ditto for genomicDNA but there are currently fewer records.
And then, unless Preparations is somehow faceted then we can't easily make a filter to exclude or include these records. Serious issue number 1.
Collections currently map their tissue samples to BoR=materialSample and direct researchers who want loans of tissue for sequencing work to query ALA in that way. With all the eDNA and genomicDNA also mapping to materialSample, how should collections now advise researchers to look for tissue samples? It will be a much more complex (and actually impossible) query because, as stated, Preparations is not a facet field so you can't easily exclude environmentalDNA or genomicDNA. From that point alone I think we have come up with a very poor solution for a group of users and data providers who only did what we asked them to do by using MaterialSample in the first place.
which gets back to - we need a separate field that can have a controlled vocabulary to allow these record types to be differentiated. Do we turn outselves in knots trying to create something out of the existing Preparations field with no time to do it properly? Or do we just keep what we have and give ourselves (and GBIF) time to think?
Bearing in mind as well that the whole BasisOfRecord concept is under review by the DarwinCore maintenance group process as well as Tim R also proposing a different solution involving category tags.

Summary is that I can put work into trying to do something/kludging/potentially really mucking up Preparations values or I can beg for just leaving the status quo until we can develop a better longer term solution. You can guess that the second option is my preference

m-hope commented 3 years ago

This will be an interpreted preparations field, meaning that the raw preparations field will be preserved and available (see original vs processed values). At the moment, neither field is on the Customise Facet interface, but can be added. Do you think that goes some way to resolving your issues, if it is actually an additional field? In that sense it shouldn't be lossy. Both fields are indexed. With the formalin problem it would be great if you could bring that knowledge to the spreadsheet.

Again I reiterate, if this is going to be a separately indexed field, why are we trying to shoehorn a free text field with multiple values that represent multiple different components of a process, into a single term fixed vocabulary? Instead why don't we implement a proto-category field where most of the terms have already been defined in gbif issue #247.

At (great) risk of further encouraging this idea of forcing preparations into a vocab (which I really don't want to do), the only way you could map the current values in this field that would be in any way useful for users, is to map multiple vocab values for each record... an example of the complexity of this task would be the preparations values of "specimenNature=Spirit specimen;specimenForm=Wet;fixativeTreatment=Ethanol 96%;storageMedium=ethanol 90%" and "specimenNature=Whole;specimenForm=Wet;fixativeTreatment=formalin 10%;storageMedium=ethanol 70%". Each have been mapped to the different values of "Wet" and "Whole" respectively. Neither of which even remotely captures the information available in the original data. Only something like "Wet | Spirit | Ethanol 95% Fixed | Ethanol 90% Stored" or "Wet | Whole| Ethanol 95% Fixed | Ethanol 90% Stored" would even begin to do that. Noting that you'd then need to map each of the 2400+ values individually, particularly as Ely points out, many of the values are abbreviations which only a human could understand. And if the ALA wants to retain any credibility with the Museum community, we'd need to get the mappings ratified by the collection owners. This won't happen quickly.

Additionally, as far as I am aware, the faceting function doesn't handle multi-value fields so this whole exercise wouldn't work as a simple mechanism to filter eDNA records out.

Can I ask why it is so important for us to do this now, especially considering Tim has already indicated above that for the time being it might be better for ALA to simply continue to use those two values in the BOR field and proposed a simple implementation option?

djtfmartin commented 3 years ago

Additionally, as far as I am aware, the faceting function doesn't handle multi-value fields so this whole exercise wouldn't work as a simple mechanism to filter eDNA records out.

The UI supports this.

The vocab file is misleading in that we have lots of data that are being provided in "Spirit | Skull" style format - as recommended in DwC terms. The logic in pipelines splits the raw preparations by "|" and then uses the linked vocab file to do the mapping. We then store the values for preparations as a multi-value field.

javier-molina commented 3 years ago

@djtfmartin I moved this back to in progress so devs don't try to pick this for review.

m-hope commented 3 years ago

Just noting tuco_tuco's comments in gbif/vocabulary#97 pretty much correlate with my comments above in terms of the difficulty in trying to map a standard vocab to a field that (a) was never designed to be standardised and (b) covers several different concepts in the one term, and also notes the immensity of the task, which is currently being worked on by a group of over 40 people.

Again, I reiterate, while we all agree that something needs to be done about BoR at some point, @peggynewman or @djtfmartin, can you please explain why we need to immediately put in place a rushed solution that will make things worse for our users?

peggynewman commented 3 years ago

It's because in the pipelines work, we wanted to be able to implement only GBIF's transform module for basis of record and not create a new ALA transform to deal with just our BoR values. We've been looking to reuse GBIF code wherever possible to maintain consistency. We mistakenly thought that this would be workable given the vocab work that's been happening in GBIF and that there would be some other way of providing filters for the eDNA problem.

m-hope commented 3 years ago

Fair enough... but it seems to me that trying to map the preparations field to a standard vocab and then changing how the ALA filters eDNA records would be a lot more work than modifying GBIF's transform module in the short term. Once DwC works out a solution to handling eDNA that doesn't just lump it in with everything else, then we can adopt that and move to GBIF's transform module. I've been looking for an alternative appropriate DwC term that is currently unused that we could use to temporarily to store a material sample type in, but not having much luck.

peggynewman commented 3 years ago

Whether it's more or less work to do one or the option might not look the same as before with the new infrastructure. Ideally we'd rather not deploy something that we know has to be backed out later, knowing how hard it is to back things out. A vocab can change.

You and I talked about samplingProtocol for this use some time ago and I'm sure there are thoughts about other things. Having a shared code base now means that ideally we would work together with GBIF and TDWG on solutions for this problem rather than just wait for the answer to come to us.

djtfmartin commented 3 years ago

thanks @m-hope - just on the implementation side of things...

I think I did the wrong thing sharing 2,500 different values in the first place :)

Looking at the data again, 96% of the records with preparations are covered by the top 100 values - and these values all look pretty straightforward (and similar to the suggested vocab in the DwC term definition). They include multiple values, separated by pipe character.

In other words, we can just map the top 100 preparations raw values, and we've covered 96% of the records.

On implementing the mapping in processing, this already done and was very simple to do (we just reuse GBIF's vocab library). Exposing the field as a facet in the UI is just configuration.

elywallis commented 3 years ago

Woah, so now Preparations is out and we've jumped to a whole new solution???

Tim R and I had discussed his proposal to use a flag against Registry entries and had already come to the separate conclusion that it wasn't going to be granular enough if it was done at the Data Resource level. And certainly not until the current ALA Collectory entries get a massive cleanup (for the museum entries at least where they need to map to GRSciColl - which they currently often don't).

Can I remind everyone that the data resource is "OZCAM provider for [x museum]" but this DR will comprise collections for preserved specimens, observations, tissue samples and (in the future) fossils as well. It's simply too coarse to label this whole DR as being a single 'content type'. And particularly using a label that can't be seen on the front end. And that's in addition to no clarity on who gets to pick what "type" is selected, how is it changed and how does the actual data provider have any say in how their organisation is characterised?

Robina suggested that this is only going to apply to current DRs that are environmentalDNA and genomicDNA and it is true that currently these are mutually exclusive from the collections. So maybe it will work as a stop gap but I will then come back to my user acceptance criteria:

How do Margaret, Keith and Ursula tell a researcher who wants to use some tissue for DNA work how to find a tissue sample in ALA Biocache after this is implemented?
How does anyone who wants to filter in or out environmentalDNA from the ALA do that after this is implemented?

Until someone can give me a set of clear (and easy) steps to do those things I'm going to keep making a fuss. Those steps need to be documented and I want to know how it's going to be communicated to those most affected by these changes. Having promised collections that "nothing would change on the front end" I'm really not happy that this clearly articulated and identified user issue seems to be being pushed aside.

I am well aware that your goal is to go live this week. But, folks - this is about users. They're the ones being forgotten in all this.

PS Sorry for the rant but I'm just cross that, having identified an issue, that we're racing to push in a solution that we won't have time to test and that may or may not work just for the sake of meeting a self imposed deadline.

m-hope commented 3 years ago

Sorry, I missed the conversation on setting data types at the collection level... but I support Ely's rant on this one... a solution seems to be being steamrolled through without any regard to the end result... the ALA already has a DR that contains both Human Observation data and eDNA data (https://collections.ala.org.au/public/show/dr11663), and I can only see this increase as eDNA becomes mainstream as another tool to survey biodiversity. Setting data type at the DR level is way too course for users to extract/filter out eDNA records.

javier-molina commented 3 years ago

I'm closing this in the meantime.

We implemented #431 as an interim solution and we are going to get back to the drawing board to look for a more suitable solution.