bat-literature / bat-literature.github.io

The Bat Literature Project aims to facilitate discovery of scientific literature on bats (Chiroptera)
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

approach curating duplicate literature entries #6

Open jhpoelen opened 2 months ago

jhpoelen commented 2 months ago

In the process of merging the various literature corporate shared by DeeAnn, Nancy and Kendra, some duplicates appear in the batlit v0.1 .

For instance,

id authors date title journal doi
https://www.zotero.org/groups/bat_literature_project/items/I8FSKML3 Goldstein | Anthony | Gbakima | Bird | Bangura | Tremeau-Bravard | Belaganahalli | Wells | Dhanota | Liang | Grodus | Jangra | DeJesus | Lasso | Smith | Jambai | Kamara | Kamara | Bangura | Monagin | Shapira | Johnson | Saylors | Rubin | Chandran | Lipkin | Mazet 2018-08-27 The discovery of Bombali virus adds further support for bats as hosts of ebolaviruses Nature Microbiology 10.1038/s41564-018-0227-2
https://www.zotero.org/groups/bat_literature_project/items/7CL5RBDS Goldstein | Anthony | Gbakima | Bird | Bangura | Tremeau-Bravard | Belaganahalli | Wells | Dhanota | Liang | Grodus | Jangra | DeJesus | Lasso | Smith | Jambai | Kamara | Kamara | Bangura | Monagin | Shapira | Johnson | Saylors | Rubin | Chandran | Lipkin | Mazet 2018-10 The discovery of Bombali virus adds further support for bats as hosts of ebolaviruses Nature Microbiology 10.1038/s41564-018-0227-2

@ajacsherman as your role of curator of the batlit corpus, what is your approach to handling duplicates?

Some options I can think of :

  1. merge records
  2. link duplicate records
  3. do nothing

Also, I assume that updates are expected to come in from various different sources . . . making further duplications likely. How do you imagine handling updates?

image

ajacsherman commented 1 month ago

Hi Jorrit, I have been deduplicating within our contributor's individual libraries by merging the records. I am making copies of these named folders (i.e. "Reeder") in my own personal folders so I can keep a record of who contributed what. Now that I have these archived, I can eliminate the duplicates for the complete Bat Literature Collection. Will that help? Going forward, I plan to curate individual collections in a separate folder and then add the pdfs to the larger shared folder once my curation has been completed. I can work on that right now if it moves your process forward?

jhpoelen commented 1 month ago

@ajacsherman thanks for sharing how you are handling duplicates.

Can you provide one or two examples of such merged duplicates?

ajacsherman commented 1 month ago

It looks like I need a plugin to track my past activity. I will merge two files now...

ajacsherman commented 1 month ago

Abdelgawad A, Damiani A, Ho SY, Strauss G, Szentiks CA, East ML, Osterrieder N, Greenwood AD. Zebra Alphaherpesviruses (EHV-1 and EHV-9): Genetic Diversity, Latency and Co-Infections. Viruses. 2016 Sep 20;8(9):262. doi: 10.3390/v8090262. PMID: 27657113; PMCID: PMC5035975.

BLR_merge3 BLR_merge2 BLR_merge1

jhpoelen commented 1 month ago

With your example, I was able to trace down the original duplicate records related to doi:10.3390/v8090262

via

preston ls --algo md5\
 | grep items?\
 | preston grep "10.3390/v8090262"\
 | grep value

yielding the expected two search hits -

<line:hash://md5/1eeb212dfd65e7549522aef68497604d!/L1072> <http://www.w3.org/ns/prov#value> "            \"DOI\": \"10.3390/v8090262\"," <urn:uuid:34bd5285-9e7d-426b-b1fd-996516be6a7f> .
<line:hash://md5/0f362972b3eb321073cee168c180c307!/L9756> <http://www.w3.org/ns/prov#value> "            \"DOI\": \"10.3390/v8090262\"," <urn:uuid:8a08d8a3-9390-4d51-bc15-309183fccdab> .
curl 'https://linker.bio/line:hash://md5/1eeb212dfd65e7549522aef68497604d!/L968-L1091'
    {
        "key": "YWNCWPYJ",
        "version": 2600,
        "library": {
            "type": "group",
            "id": 5435545,
            "name": "Bat Literature Project",
            "links": {
                "alternate": {
                    "href": "https://www.zotero.org/groups/bat_literature_project",
                    "type": "text/html"
                }
            }
        },
        "links": {
            "self": {
                "href": "https://api.zotero.org/groups/5435545/items/YWNCWPYJ",
                "type": "application/json"
            },
            "alternate": {
                "href": "https://www.zotero.org/groups/bat_literature_project/items/YWNCWPYJ",
                "type": "text/html"
            },
            "attachment": {
                "href": "https://api.zotero.org/groups/5435545/items/JGTYJ4TR",
                "type": "application/json",
                "attachmentType": "application/pdf",
                "attachmentSize": 1609849
            }
        },
        "meta": {
            "createdByUser": {
                "id": 13229919,
                "username": "acsherman",
                "name": "",
                "links": {
                    "alternate": {
                        "href": "https://www.zotero.org/acsherman",
                        "type": "text/html"
                    }
                }
            },
            "creatorSummary": "Abdelgawad et al.",
            "parsedDate": "2016-09-20",
            "numChildren": 1
        },
        "data": {
            "key": "YWNCWPYJ",
            "version": 2600,
            "itemType": "journalArticle",
            "title": "Zebra Alphaherpesviruses (EHV-1 and EHV-9): Genetic Diversity, Latency and Co-Infections",
            "creators": [
                {
                    "creatorType": "author",
                    "firstName": "Azza",
                    "lastName": "Abdelgawad"
                },
                {
                    "creatorType": "author",
                    "firstName": "Armando",
                    "lastName": "Damiani"
                },
                {
                    "creatorType": "author",
                    "firstName": "Simon",
                    "lastName": "Ho"
                },
                {
                    "creatorType": "author",
                    "firstName": "Günter",
                    "lastName": "Strauss"
                },
                {
                    "creatorType": "author",
                    "firstName": "Claudia",
                    "lastName": "Szentiks"
                },
                {
                    "creatorType": "author",
                    "firstName": "Marion",
                    "lastName": "East"
                },
                {
                    "creatorType": "author",
                    "firstName": "Nikolaus",
                    "lastName": "Osterrieder"
                },
                {
                    "creatorType": "author",
                    "firstName": "Alex",
                    "lastName": "Greenwood"
                }
            ],
            "abstractNote": "Alphaherpesviruses are highly prevalent in equine populations and co-infections with more than one of these viruses’ strains frequently diagnosed. Lytic replication and latency with subsequent reactivation, along with new episodes of disease, can be influenced by genetic diversity generated by spontaneous mutation and recombination. Latency enhances virus survival by providing an epidemiological strategy for long-term maintenance of divergent strains in animal populations. The alphaherpesviruses equine herpesvirus 1 (EHV-1) and 9 (EHV-9) have recently been shown to cross species barriers, including a recombinant EHV-1 observed in fatal infections of a polar bear and Asian rhinoceros. Little is known about the latency and genetic diversity of EHV-1 and EHV-9, especially among zoo and wild equids. Here, we report evidence of limited genetic diversity in EHV-9 in zebras, whereas there is substantial genetic variability in EHV-1. We demonstrate that zebras can be lytically and latently infected with both viruses concurrently. Such a co-occurrence of infection in zebras suggests that even relatively slow-evolving viruses such as equine herpesviruses have the potential to diversify rapidly by recombination. This has potential consequences for the diagnosis of these viruses and their management in wild and captive equid populations.",
            "publicationTitle": "Viruses",
            "volume": "8",
            "issue": "9",
            "pages": "262",
            "date": "2016-09-20",
            "series": "",
            "seriesTitle": "",
            "seriesText": "",
            "journalAbbreviation": "Viruses",
            "language": "en",
            "DOI": "10.3390/v8090262",
            "ISSN": "1999-4915",
            "shortTitle": "Zebra Alphaherpesviruses (EHV-1 and EHV-9)",
            "url": "http://www.mdpi.com/1999-4915/8/9/262",
            "accessDate": "2024-04-18T21:02:03Z",
            "archive": "",
            "archiveLocation": "",
            "libraryCatalog": "DOI.org (Crossref)",
            "callNumber": "",
            "rights": "https://creativecommons.org/licenses/by/4.0/",
            "extra": "",
            "tags": [],
            "collections": [
                "DZKBQXJR"
            ],
            "relations": {},
            "dateAdded": "2024-04-18T21:02:03Z",
            "dateModified": "2024-04-18T21:02:03Z"
        }
    },

and

preston cat 'line:hash://md5/0f362972b3eb321073cee168c180c307!/L9653-L9802'

or https://linker.bio/line:hash://md5/0f362972b3eb321073cee168c180c307!/L9653-L9802

producing:

    {
        "key": "2PWXAVQL",
        "version": 803,
        "library": {
            "type": "group",
            "id": 5435545,
            "name": "Bat Literature Project",
            "links": {
                "alternate": {
                    "href": "https://www.zotero.org/groups/bat_literature_project",
                    "type": "text/html"
                }
            }
        },
        "links": {
            "self": {
                "href": "https://api.zotero.org/groups/5435545/items/2PWXAVQL",
                "type": "application/json"
            },
            "alternate": {
                "href": "https://www.zotero.org/groups/bat_literature_project/items/2PWXAVQL",
                "type": "text/html"
            },
            "attachment": {
                "href": "https://api.zotero.org/groups/5435545/items/6AZNRAQN",
                "type": "application/json",
                "attachmentType": "text/html"
            }
        },
        "meta": {
            "createdByUser": {
                "id": 6296343,
                "username": "deeannreeder",
                "name": "",
                "links": {
                    "alternate": {
                        "href": "https://www.zotero.org/deeannreeder",
                        "type": "text/html"
                    }
                }
            },
            "creatorSummary": "Abdelgawad et al.",
            "parsedDate": "2016-09",
            "numChildren": 2
        },
        "data": {
            "key": "2PWXAVQL",
            "version": 803,
            "itemType": "journalArticle",
            "title": "Zebra Alphaherpesviruses (EHV-1 and EHV-9): Genetic Diversity, Latency and Co-Infections",
            "creators": [
                {
                    "creatorType": "author",
                    "firstName": "Azza",
                    "lastName": "Abdelgawad"
                },
                {
                    "creatorType": "author",
                    "firstName": "Armando",
                    "lastName": "Damiani"
                },
                {
                    "creatorType": "author",
                    "firstName": "Simon Y. W.",
                    "lastName": "Ho"
                },
                {
                    "creatorType": "author",
                    "firstName": "Günter",
                    "lastName": "Strauss"
                },
                {
                    "creatorType": "author",
                    "firstName": "Claudia A.",
                    "lastName": "Szentiks"
                },
                {
                    "creatorType": "author",
                    "firstName": "Marion L.",
                    "lastName": "East"
                },
                {
                    "creatorType": "author",
                    "firstName": "Nikolaus",
                    "lastName": "Osterrieder"
                },
                {
                    "creatorType": "author",
                    "firstName": "Alex D.",
                    "lastName": "Greenwood"
                }
            ],
            "abstractNote": "Alphaherpesviruses are highly prevalent in equine populations and co-infections with more than one of these viruses’ strains frequently diagnosed. Lytic replication and latency with subsequent reactivation, along with new episodes of disease, can be influenced by genetic diversity generated by spontaneous mutation and recombination. Latency enhances virus survival by providing an epidemiological strategy for long-term maintenance of divergent strains in animal populations. The alphaherpesviruses equine herpesvirus 1 (EHV-1) and 9 (EHV-9) have recently been shown to cross species barriers, including a recombinant EHV-1 observed in fatal infections of a polar bear and Asian rhinoceros. Little is known about the latency and genetic diversity of EHV-1 and EHV-9, especially among zoo and wild equids. Here, we report evidence of limited genetic diversity in EHV-9 in zebras, whereas there is substantial genetic variability in EHV-1. We demonstrate that zebras can be lytically and latently infected with both viruses concurrently. Such a co-occurrence of infection in zebras suggests that even relatively slow-evolving viruses such as equine herpesviruses have the potential to diversify rapidly by recombination. This has potential consequences for the diagnosis of these viruses and their management in wild and captive equid populations.",
            "publicationTitle": "Viruses",
            "volume": "8",
            "issue": "9",
            "pages": "262",
            "date": "2016/9",
            "series": "",
            "seriesTitle": "",
            "seriesText": "",
            "journalAbbreviation": "",
            "language": "en",
            "DOI": "10.3390/v8090262",
            "ISSN": "",
            "shortTitle": "Zebra Alphaherpesviruses (EHV-1 and EHV-9)",
            "url": "https://www.mdpi.com/1999-4915/8/9/262",
            "accessDate": "2021-01-22T15:11:19Z",
            "archive": "",
            "archiveLocation": "",
            "libraryCatalog": "www.mdpi.com",
            "callNumber": "",
            "rights": "http://creativecommons.org/licenses/by/3.0/",
            "extra": "Number: 9\nPublisher: Multidisciplinary Digital Publishing Institute",
            "tags": [
                {
                    "tag": "EHV-1",
                    "type": 1
                },
                {
                    "tag": "EHV-9",
                    "type": 1
                },
                {
                    "tag": "co-occurrence",
                    "type": 1
                },
                {
                    "tag": "diversity",
                    "type": 1
                },
                {
                    "tag": "latency",
                    "type": 1
                },
                {
                    "tag": "zebra",
                    "type": 1
                }
            ],
            "collections": [
                "DZKBQXJR"
            ],
            "relations": {
                "owl:sameAs": "http://zotero.org/groups/2719577/items/WNTGIJCX"
            },
            "dateAdded": "2024-03-07T00:47:03Z",
            "dateModified": "2024-03-07T00:47:03Z"
        }
    },
jhpoelen commented 1 month ago

Note, however, that one of the original records available in the v0.2 version of batlit appears to be available in the current version as accessed on 2024-06-03 -

https://www.zotero.org/groups/5435545/bat_literature_project/items/YWNCWPYJ

and one is no longer available:

https://www.zotero.org/groups/5435545/bat_literature_project/items/2PWXAVQL

And, luckily, Zotero keeps a record of the replaced record 2PWXAVQL in statement "dc:replaces": "http://zotero.org/groups/5435545/items/2PWXAVQL", in the updated metadata retrieved via

ZOTERO_TOKEN=[SECRET] preston track https://api.zotero.org/groups/5435545/items/YWNCWPYJ

containing

{
    "key": "YWNCWPYJ",
    "version": 11620,
    "library": {
        "type": "group",
        "id": 5435545,
        "name": "Bat Literature Project",
        "links": {
            "alternate": {
                "href": "https://www.zotero.org/groups/bat_literature_project",
                "type": "text/html"
            }
        }
    },
    "links": {
        "self": {
            "href": "https://api.zotero.org/groups/5435545/items/YWNCWPYJ",
            "type": "application/json"
        },
        "alternate": {
            "href": "https://www.zotero.org/groups/bat_literature_project/items/YWNCWPYJ",
            "type": "text/html"
        },
        "attachment": {
            "href": "https://api.zotero.org/groups/5435545/items/JGTYJ4TR",
            "type": "application/json",
            "attachmentType": "application/pdf",
            "attachmentSize": 1609849
        }
    },
    "meta": {
        "createdByUser": {
            "id": 13229919,
            "username": "acsherman",
            "name": "",
            "links": {
                "alternate": {
                    "href": "https://www.zotero.org/acsherman",
                    "type": "text/html"
                }
            }
        },
        "creatorSummary": "Abdelgawad et al.",
        "parsedDate": "2016-09-20",
        "numChildren": 3
    },
    "data": {
        "key": "YWNCWPYJ",
        "version": 11620,
        "itemType": "journalArticle",
        "title": "Zebra Alphaherpesviruses (EHV-1 and EHV-9): Genetic Diversity, Latency and Co-Infections",
        "creators": [
            {
                "creatorType": "author",
                "firstName": "Azza",
                "lastName": "Abdelgawad"
            },
            {
                "creatorType": "author",
                "firstName": "Armando",
                "lastName": "Damiani"
            },
            {
                "creatorType": "author",
                "firstName": "Simon",
                "lastName": "Ho"
            },
            {
                "creatorType": "author",
                "firstName": "Günter",
                "lastName": "Strauss"
            },
            {
                "creatorType": "author",
                "firstName": "Claudia",
                "lastName": "Szentiks"
            },
            {
                "creatorType": "author",
                "firstName": "Marion",
                "lastName": "East"
            },
            {
                "creatorType": "author",
                "firstName": "Nikolaus",
                "lastName": "Osterrieder"
            },
            {
                "creatorType": "author",
                "firstName": "Alex",
                "lastName": "Greenwood"
            }
        ],
        "abstractNote": "Alphaherpesviruses are highly prevalent in equine populations and co-infections with more than one of these viruses’ strains frequently diagnosed. Lytic replication and latency with subsequent reactivation, along with new episodes of disease, can be influenced by genetic diversity generated by spontaneous mutation and recombination. Latency enhances virus survival by providing an epidemiological strategy for long-term maintenance of divergent strains in animal populations. The alphaherpesviruses equine herpesvirus 1 (EHV-1) and 9 (EHV-9) have recently been shown to cross species barriers, including a recombinant EHV-1 observed in fatal infections of a polar bear and Asian rhinoceros. Little is known about the latency and genetic diversity of EHV-1 and EHV-9, especially among zoo and wild equids. Here, we report evidence of limited genetic diversity in EHV-9 in zebras, whereas there is substantial genetic variability in EHV-1. We demonstrate that zebras can be lytically and latently infected with both viruses concurrently. Such a co-occurrence of infection in zebras suggests that even relatively slow-evolving viruses such as equine herpesviruses have the potential to diversify rapidly by recombination. This has potential consequences for the diagnosis of these viruses and their management in wild and captive equid populations.",
        "publicationTitle": "Viruses",
        "volume": "8",
        "issue": "9",
        "pages": "262",
        "date": "2016-09-20",
        "series": "",
        "seriesTitle": "",
        "seriesText": "",
        "journalAbbreviation": "Viruses",
        "language": "en",
        "DOI": "10.3390/v8090262",
        "ISSN": "1999-4915",
        "shortTitle": "Zebra Alphaherpesviruses (EHV-1 and EHV-9)",
        "url": "http://www.mdpi.com/1999-4915/8/9/262",
        "accessDate": "2024-04-18T21:02:03Z",
        "archive": "",
        "archiveLocation": "",
        "libraryCatalog": "DOI.org (Crossref)",
        "callNumber": "",
        "rights": "https://creativecommons.org/licenses/by/4.0/",
        "extra": "",
        "tags": [
            {
                "tag": "EHV-1",
                "type": 1
            },
            {
                "tag": "EHV-9",
                "type": 1
            },
            {
                "tag": "co-occurrence",
                "type": 1
            },
            {
                "tag": "diversity",
                "type": 1
            },
            {
                "tag": "latency",
                "type": 1
            },
            {
                "tag": "zebra",
                "type": 1
            }
        ],
        "collections": [
            "UAWY6DNP"
        ],
        "relations": {
            "dc:replaces": "http://zotero.org/groups/5435545/items/2PWXAVQL",
            "owl:sameAs": "http://zotero.org/groups/2719577/items/WNTGIJCX"
        },
        "dateAdded": "2024-03-07T00:47:03Z",
        "dateModified": "2024-05-30T19:18:31Z"
    }
}
jhpoelen commented 1 month ago

Which gives us a possible strategy for dealing with duplicate literature entities:

  1. Aja (or other curator) deduplicates entries using the Zotero merge tool
  2. Jorrit (or some other data archivist) take a versioned snapshot of the Zotero group
  3. Preston (or some other robot) detects "dc:replaces": "http://zotero.org/groups/5435545/items/2PWXAVQL" and translates this into an action to annotate any existing Zenodo record associated with http://zotero.org/groups/5435545/items/2PWXAVQL as deprecated and being replaced by https://www.zotero.org/groups/bat_literature_project/items/YWNCWPYJ .
jhpoelen commented 1 month ago

@ajacsherman @myrmoteras I've added the proposed deduplication workflow for your review at https://bat-literature.github.io/#deduplication-workflow .

myrmoteras commented 2 weeks ago

@ajacsherman @jhpoelen can we add a deduplication step by comparing what is in BLR, coviho community?

We also need to consider how to do this routinely if we allow others to upload records to the batlit community.

May be we should deduplicate against the entire Zenodo?