bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

use bloom filters to calculate common terms across datasets #113

Closed jhpoelen closed 2 years ago

jhpoelen commented 3 years ago

Currently, Preston supports a "grep" onto all content in a biodiversity data graph. This preston grep digs into archives and compressed content to detect terms matching the provided regular expression.

This current functionality connects specific matching terms to content ids. Given that datasets may contain up to millions of matched terms (e.g., arctos identifiers), the resulting graphs become large and hard to handle by a modest triple store or graph database.

Because we are interested in how datasets connect, we first want to know which datasets connect, then proceed to better understand how they connect. Bloom filters help to approximate space/time efficient set memberships and calculate the cardinality of set intersects.

In our case, calculating a bloom filter with all matched terms for each dataset enable to compute the approximate overlap (intersect) between matched terms across datasets.

Implementation sketch:

  1. match all content against provided regular expression
  2. add each matched term to a dataset specific bloomfilter (one bloom filter per content hash)
  3. on non-empty bloom filter, store a serialized copy of the resulting bloom filter in the biodiversity graph on completion of an individual content/dataset scan
  4. for each dataset, report number of shared matched terms for all other datasets using all reported bloom filters <--- requires knowledge of available bloom filters and suggests grid search with potential scalability issues. Assumption is that for suitable identifiers schemes, only a relatively small amount of datasets will have shared identifiers.
jhpoelen commented 3 years ago

fyi @mielliott

I've implement a first pass at using bloom filters to estimate overlapping values.

Pattern:

$ preston ls | preston grep "[a-z]+" | preston bloom | preston diff
...
<urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:79038204-5eb7-499a-ad9a-72a068ab96c4> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<hash://sha256/8a1d81d68eb1f17851ec87d98c63d2e5756436cf9e01b13103e86b6b52cf712b> <http://purl.obolibrary.org/obo/RO_0002131> <hash://sha256/b04a895f356a3884c89a9ab76140343ca4be71b495c406a3939aa34d7fcb8290> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:a2c81e40-96a4-4dde-bc11-7356868dbbdb> <http://www.w3.org/ns/prov#value> "2"^^<http://www.w3.org/2001/XMLSchema#long> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:a2c81e40-96a4-4dde-bc11-7356868dbbdb> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:c6f579fd-bf48-4320-a4d2-94b29205033d> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:c6f579fd-bf48-4320-a4d2-94b29205033d> <http://www.w3.org/ns/prov#used> <hash://sha256/8a1d81d68eb1f17851ec87d98c63d2e5756436cf9e01b13103e86b6b52cf712b> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:c6f579fd-bf48-4320-a4d2-94b29205033d> <http://www.w3.org/ns/prov#used> <bloom:gz:hash://sha256/701f71f7f8385951ed2c66ce4a70f129446d0bd31df957e24a092ee0e80bae48> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:c6f579fd-bf48-4320-a4d2-94b29205033d> <http://www.w3.org/ns/prov#used> <hash://sha256/b04a895f356a3884c89a9ab76140343ca4be71b495c406a3939aa34d7fcb8290> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
<urn:uuid:c6f579fd-bf48-4320-a4d2-94b29205033d> <http://www.w3.org/ns/prov#used> <bloom:gz:hash://sha256/bc49ae61d0ed2d6fd291bacce98bdf7061f3e497e9a94207eb4f1a219d57f2d3> <urn:uuid:cd758dfd-c54f-4748-8c79-e67b9b93d50b> .
...

which first lists a biodiversity data graph (preston ls), then uses a regular expression to select values (preston grep [regular expression]), using these values a bloom filter is calculated for each content (preston bloom), then calculate overlap/diff between bloom filters (preston diff).

Preston reports overlap using a qualified generation with estimated number of overlapping values:

<urn:uuid:a2c...> <...#value> "2"^^<http://www.w3.org/2001/XMLSchema#long> <urn:uuid:cd75...> .

as well as a "overlaps with" relation (http://purl.obolibrary.org/obo/RO_0002131) between the two related content hashes.

<hash://sha256/8a...> <http://purl.obolibrary.org/obo/RO_0002131> <hash://sha256/b04a...> 
jhpoelen commented 3 years ago

Needless to say that this implementation does not scale well: although bloom filters can be calculated in parallel, the comparison between bloom filters involves a point-to-point comparison:

with three provenance graph with datasets { A, B, C } we'd compare:

AB, BC, AC

with four { A, B, C, D }:

BA CA CB DA DB DC ...

etc.

Auch!

jhpoelen commented 3 years ago

Started a test with Meise Botanical Garden identifier pattern (i.e., 'http://www.botanicalcollections.be/specimen/[a-zA-Z]+[0-9]+V?') across a recent GBIF/iDigBio version at 4pm Pacific on 2021-03-26 using

preston cat $REMOTE hash://sha256/1fd3e156c6ba1632a27b2bebaea36f76afeac8dfecf530d772988832821304ea\
| tee progress.log\
| ./preston-bloom grep --no-cache $REMOTE 'http://www.botanicalcollections.be/specimen/[a-zA-Z]+[0-9]+V?' \
| ./preston-bloom bloom $REMOTE --no-cache\
| ./preston-bloom process

# then do diff
preston ls\
| ./preston-bloom diff\
| tee progress-diff.log\
| preston process

where $REMOTE is some --remote [url]

jhpoelen commented 3 years ago

result from Meise identifier bloom generation and diff for latest snapshot of iDigBio/GBIF took about 2 hours to calculate, resulting in detection of three datasets using Meise identifiers schemes, with two dataset pairs with overlapping identifiers:

$ preston ls | grep RO
<hash://sha256/5f46373d3a755e9d73ec6473ee4a9935685fff8f6461ae00293598d9048c568d> <http://purl.obolibrary.org/obo/RO_0002131> <hash://sha256/408cbce2e62ec55ea5e7368f146a4af5732441a254d1b04dd569ab059ecbce29> <urn:uuid:ef540c83-10fd-4c48-993d-744329b7a445> .
<hash://sha256/5f46373d3a755e9d73ec6473ee4a9935685fff8f6461ae00293598d9048c568d> <http://purl.obolibrary.org/obo/RO_0002131> <hash://sha256/c57f492c224215e6577ba486e5c13ffaeca91d6c2afccbbbf67e09d5e24c5425> <urn:uuid:0cf46c09-0974-44f1-a354-575a5eb37efa> .

with associated EML descriptors:

<?xml version='1.0' encoding='utf-8'?><eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.0.1" xmlns:md="eml://ecoinformatics.org/methods-2.0.1" xmlns:proj="eml://ecoinformatics.org/project-2.0.1" xmlns:d="eml://ecoinformatics.org/dataset-2.0.1" xmlns:res="eml://ecoinformatics.org/resource-2.0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/terms/" system="Plazi" scope="system" packageId="A5574A30AD0F2621CE79A4483D39FFA6/eml-1609779870709.xml"><dataset><alternateIdentifier>A5574A30AD0F2621CE79A4483D39FFA6</alternateIdentifier><alternateIdentifier>https://doi.org/10.12705/675.7</alternateIdentifier><alternateIdentifier>a2afe874-9ec7-4101-8f63-da98506a340b</alternateIdentifier><alternateIdentifier>0024-4082</alternateIdentifier><alternateIdentifier>1488314</alternateIdentifier><citation>Thomas Borsch, Hilda Flores-Olvera, Silvia Zumaya, Kai Müller (2018): Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity. Taxon 67 (5): 944-976, DOI: https://doi.org/10.12705/675.7</citation><title>Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity</title><creator><individualName><surName>Thomas Borsch</surName></individualName></creator><creator><individualName><surName>Hilda Flores-Olvera</surName></individualName></creator><creator><individualName><surName>Silvia Zumaya</surName></individualName></creator><creator><individualName><surName>Kai Müller</surName></individualName></creator><pubDate>2018</pubDate><language>en</language><abstract><para>This dataset contains the digitized treatments in Plazi based on the original journal article Thomas Borsch, Hilda Flores-Olvera, Silvia Zumaya, Kai Müller (2018): Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity. Taxon 67 (5): 944-976, DOI: https://doi.org/10.12705/675.7</para></abstract><intellectualRights><para>To the extent possible under law, the publisher has waived all copyright and related or neighboring rights to these data, and has released them under<ulink url="https://creativecommons.org/publicdomain/zero/1.0/"><citetitle>CC0 Public Domain Dedication</citetitle></ulink>.
        Users may copy, modify, distribute and use the work, including for commercial purposes.</para><para>No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53<ulink url="https://doi.org/10.1186/1756-0500-2-53"><citetitle>https://doi.org/10.1186/1756-0500-2-53</citetitle></ulink>for further explanation.</para></intellectualRights><distribution scope="document"><online><url function="information">http://tb.plazi.org/GgServer/summary/A5574A30AD0F2621CE79A4483D39FFA6</url></online></distribution><contact><individualName><givenName>Guido</givenName><surName>Sautter</surName></individualName><electronicMailAddress>gsautter@gmail.com</electronicMailAddress><onlineUrl>http://plazi.org</onlineUrl></contact><associatedParty><organizationName>Plazi</organizationName><address><city>Bern</city><country>Switzerland</country></address><electronicMailAddress>info@plazi.org</electronicMailAddress><onlineUrl>http://plazi.org/</onlineUrl><role>publisher</role></associatedParty><metadataProvider><organizationName>Plazi</organizationName><individualName><surName>admin</surName></individualName></metadataProvider></dataset><additionalMetadata><metadata><gbif><dateStamp>2021-01-04T17:04:30+0000</dateStamp><citation>Thomas Borsch, Hilda Flores-Olvera, Silvia Zumaya, Kai Müller (2018): Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity. Taxon 67 (5): 944-976, DOI: https://doi.org/10.12705/675.7</citation></gbif><plaziMods><mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
<mods:titleInfo>
<mods:title>Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity</mods:title>
</mods:titleInfo>
<mods:name type="personal">
<mods:role>
<mods:roleTerm>Author</mods:roleTerm>
</mods:role>
<mods:namePart>Thomas Borsch</mods:namePart>
</mods:name>
<mods:name type="personal">
<mods:role>
<mods:roleTerm>Author</mods:roleTerm>
</mods:role>
<mods:namePart>Hilda Flores-Olvera</mods:namePart>
</mods:name>
<mods:name type="personal">
<mods:role>
<mods:roleTerm>Author</mods:roleTerm>
</mods:role>
<mods:namePart>Silvia Zumaya</mods:namePart>
</mods:name>
<mods:name type="personal">
<mods:role>
<mods:roleTerm>Author</mods:roleTerm>
</mods:role>
<mods:namePart>Kai Müller</mods:namePart>
</mods:name>
<mods:typeOfResource>text</mods:typeOfResource>
<mods:relatedItem type="host">
<mods:titleInfo>
<mods:title>Taxon</mods:title>
</mods:titleInfo>
<mods:part>
<mods:date>2018</mods:date>
<mods:detail type="pubDate">
<mods:number>2018-10-31</mods:number>
</mods:detail>
<mods:detail type="volume">
<mods:number>67</mods:number>
</mods:detail>
<mods:detail type="issue">
<mods:number>5</mods:number>
</mods:detail>
<mods:extent unit="page">
<mods:start>944</mods:start>
<mods:end>976</mods:end>
</mods:extent>
</mods:part>
</mods:relatedItem>
<mods:classification>journal article</mods:classification>
<mods:identifier type="DOI">https://doi.org/10.12705/675.7</mods:identifier>
<mods:identifier type="GBIF-Dataset">a2afe874-9ec7-4101-8f63-da98506a340b</mods:identifier>
<mods:identifier type="ISSN">0024-4082</mods:identifier>
<mods:identifier type="Zenodo-Dep">1488314</mods:identifier>
</mods:mods></plaziMods></metadata></additionalMetadata></eml:eml><eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
         xmlns:dc="http://purl.org/dc/terms/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd"
         packageId="http://www.pnwherbaria.org/data/getdataset.php?File=WTU_Vascular_DwCA.zip" system="http://gbif.org" scope="system"
         xml:lang="eng">
  <dataset>
    <alternateIdentifier>91e11e66-18d0-487f-b1f1-5f83a00ed76b</alternateIdentifier>
    <alternateIdentifier>http://www.pnwherbaria.org/data/getdataset.php?File=WTU_Vascular_DwCA.zip</alternateIdentifier>
    <title xml:lang="eng">Vascular Plant Collection, University of Washington Herbarium</title>
    <pubDate>Sat, 27 February 2021 02:05:27</pubDate>
    <language>eng</language>

    <contact>
      <individualName>
        <givenName>David</givenName>
        <surName>Giblin</surName>
      </individualName>
      <positionName>Collections Manager</positionName>
      <organizationName>University of Washington Herbarium</organizationName>
      <phone>1-206-543-1682</phone>
      <electronicMailAddress>wtu@u.washington.edu</electronicMailAddress>
      <onlineUrl>http://www.burkemuseum.org/herbarium</onlineUrl>
    </contact>

    <creator>
      <individualName>
        <givenName>David</givenName>
        <surName>Giblin</surName>
      </individualName>
      <positionName>Collections Manager</positionName>
      <organizationName>University of Washington Herbarium</organizationName>
      <phone>1-206-543-1682</phone>
      <electronicMailAddress>wtu@u.washington.edu</electronicMailAddress>
      <onlineUrl>http://www.burkemuseum.org/herbarium</onlineUrl>
    </creator>

    <metadataProvider>
      <individualName>
        <givenName>Ben</givenName>
        <surName>Legler</surName>
      </individualName>
      <organizationName>University of Washington Burke Museum</organizationName>
      <positionName>Informatics Specialist</positionName>
      <address>
        <deliveryPoint>Box 355325</deliveryPoint>
        <city>Seattle</city>
        <administrativeArea>WA</administrativeArea>
        <postalCode>98195</postalCode>
        <country>US</country>
      </address>
      <phone>1-206-221-5234</phone>
      <electronicMailAddress>blegler@u.washington.edu</electronicMailAddress>
      <onlineUrl>http://www.burkemuseum.org/herbarium</onlineUrl>
    </metadataProvider>

    <abstract>
      <para>The herbarium's total holdings number over 600,000 specimens of vascular and nonvascular plants, fungi, lichens, and marine algae. The herbarium maintains a regional focus on the Pacific Northwest, covering Washington, Oregon, Idaho, Montana, Alaska, British Columbia, and the Yukon Territory. Other significant collections come from California, the rest of Western North America, and the Pacific Rim. Our oldest specimens date to the late 1800's. Particularly active periods of growth for the herbarium occurred with the incorporation of the herbarium of J. William Thompson in 1943, collections made under the direction of C. Leo Hitchcock in the 1930's - 1950's, and field work since 2002.</para>
    </abstract>

    <keywordSet>

      <keyword>Herbarium</keyword>
      <keyword>specimen</keyword>
      <keyword>WTU</keyword>
      <keyword>Vascular Plants</keyword>
      <keyword>U.S.A.</keyword>
      <keyword>Washington</keyword>
      <keywordThesaurus>n/a</keywordThesaurus>
    </keywordSet>

    <intellectualRights>
      <para>License for specimen data: Public Domain (https://creativecommons.org/publicdomain/zero/1.0/). License for media, including photographs and images of specimens: CC BY-NC-SA 3.0 License (http://creativecommons.org/licenses/by-nc-sa/3.0/us/). Users of the data are encouraged to acknowledge the source of the data if any of these records or media are used for publications, analyses, reports, or on web sites. The provider of the data, and its staff, are not responsible for damages, injury or loss due to the use of these data. Fitness of use must be determined by the user of the data.</para>
    </intellectualRights>

    <distribution scope="document">
      <online>
        <url function="information">http://www.burkemuseum.org/herbarium</url>
      </online>
    </distribution>

    <coverage>
      <taxonomicCoverage>
        <generalTaxonomicCoverage>Vascular Plants</generalTaxonomicCoverage>
        <taxonomicClassification>
          <taxonRankName>unranked</taxonRankName>
          <taxonRankValue>Vascular Plants</taxonRankValue>
          <commonName>Vascular Plants</commonName>
        </taxonomicClassification>
      </taxonomicCoverage>
    </coverage>

  </dataset>
  <additionalMetadata>
    <metadata>
      <gbif>
        <dateStamp>2021-02-13PST05:22:27-08:00</dateStamp>
        <hierarchyLevel>dataset</hierarchyLevel>
        <citation>Specimen data provided by University of Washington Herbarium (Accessed YYYY-MM-DD)</citation>
        <resourceLogoUrl>http://www.pnwherbaria.org/logos/WTU_Vascular_92px.png</resourceLogoUrl>
      </gbif>
    </metadata>
  </additionalMetadata>
</eml:eml><eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
         xmlns:dc="http://purl.org/dc/terms/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd"
         packageId="b740eaa0-0679-41dc-acb7-990d562dfa37/v1.18" system="http://gbif.org" scope="system"
         xml:lang="eng">

<dataset>
  <alternateIdentifier>b740eaa0-0679-41dc-acb7-990d562dfa37</alternateIdentifier>
  <alternateIdentifier>http://apm-ipt.br.fgov.be:8080/ipt/resource?r=botanical_collection</alternateIdentifier>
  <title xml:lang="eng">Meise Botanic Garden Herbarium (BR)</title>
      <creator>
    <individualName>
      <surName>Meise Botanic Garden</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Botanic Garden</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <onlineUrl>https://www.plantentuinmeise.be</onlineUrl>
      </creator>
      <metadataProvider>
    <individualName>
      <surName>Meise Botanic Garden</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Botanic Garden</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <onlineUrl>https://www.plantentuinmeise.be</onlineUrl>
      </metadataProvider>
      <associatedParty>
    <individualName>
        <givenName>Ann</givenName>
      <surName>Bogaerts</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Scientific Manager of the Herbarium</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <phone>+3222600957</phone>
    <electronicMailAddress>ann.bogaerts@plantentuinmeise.be</electronicMailAddress>
    <onlineUrl>http://www.plantentuinmeise.be</onlineUrl>
          <userId directory="http://orcid.org/">0000-0003-3435-2605</userId>
    <role>curator</role>
      </associatedParty>
      <associatedParty>
    <individualName>
        <givenName>Frederik</givenName>
      <surName>Leliaert</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Scientific director, Herbarium &amp; Library</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <electronicMailAddress>frederik.leliaert@plantentuinmeise.be</electronicMailAddress>
    <onlineUrl>http://www.plantentuinmeise.be</onlineUrl>
          <userId directory="http://orcid.org/">0000-0002-4627-7318</userId>
    <role>curator</role>
      </associatedParty>
      <associatedParty>
    <individualName>
        <givenName>Henry</givenName>
      <surName>Engledow</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Database Manager</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <electronicMailAddress>henry.engledow@plantentuinmeise.be</electronicMailAddress>
    <onlineUrl>http://www.plantentuinmeise.be</onlineUrl>
          <userId directory="http://orcid.org/">0000-0002-0779-8006</userId>
    <role>editor</role>
      </associatedParty>
      <associatedParty>
    <individualName>
        <givenName>Sofie</givenName>
      <surName>De Smedt</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Digitization manager</positionName>
    <address>
        <deliveryPoint>Nieuwelaan 38</deliveryPoint>
        <city>Meise</city>
        <postalCode>1860</postalCode>
        <country>BE</country>
    </address>
    <electronicMailAddress>sofie.desmedt@plantentuinmeise.be</electronicMailAddress>
    <onlineUrl>http://www.plantentuinmeise.be</onlineUrl>
          <userId directory="http://orcid.org/">0000-0001-7690-0468</userId>
    <role>curator</role>
      </associatedParty>
  <pubDate>
      2021-03-01
  </pubDate>
  <language>eng</language>
  <abstract>
    <para>Meise Botanic Garden (MeiseBG) has a long history that goes back to 1796. Today, it is an internationally recognized botanic garden in a domain of 92 hectares, and a center of excellence for plant biodiversity research with a rich collection. MeiseBG houses the 15th largest herbarium in the world, holding 4 million preserved specimens, a rich botanical library, a seed bank and a living plant collection with 25,000 different accessions from all around the world. Research focuses on plant, algal and fungal taxonomy, evolution, biodiversity conservation, ecosystems and ethnobotany. 
The preserved collections (including the herbarium, wood, carpological, slide and molecular collections) have a global geographical scope, with a focus on Central Africa, Belgium, and Southwestern Europe, with additionally  important historic collections from Latin America. Highlights are the private collections of famous 19th botanists such as Van Heurck (diatoms), Von Martius (Flora brasiliensis), von Reichenbach (Orchids) and Crépin (wild roses), which form the historic core of the collections. A wide range of taxonomic groups are covered including: vascular plants, lichens, mosses, liverworts, fungi, myxomycetes, macroalgae, and diatoms.
Meise Botanic Garden is dedicated to digitally unlock these  precious and unique botanical collections.
</para>
  </abstract>
      <keywordSet>
            <keyword>Occurrence</keyword>
        <keywordThesaurus>GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type.xml</keywordThesaurus>
      </keywordSet>
      <keywordSet>
            <keyword>Specimen</keyword>
        <keywordThesaurus>GBIF Dataset Subtype Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_subtype.xml</keywordThesaurus>
      </keywordSet>
  <intellectualRights>
    <para>This work is licensed under a <ulink url="http://creativecommons.org/licenses/by/4.0/legalcode"><citetitle>Creative Commons Attribution (CC-BY) 4.0 License</citetitle></ulink>.</para>
  </intellectualRights>
  <distribution scope="document">
    <online>
      <url function="information">https://www.botanicalcollections.be/</url>
    </online>
  </distribution>
  <coverage>
      <geographicCoverage>
          <geographicDescription>Worldwide with a focus on Central Africa, Belgium and Southwestern Europe, as well as the Indo‐Pacific marine and the Antarctic regions, with additionally important historic collections from Latin America</geographicDescription>
        <boundingCoordinates>
          <westBoundingCoordinate>-180</westBoundingCoordinate>
          <eastBoundingCoordinate>180</eastBoundingCoordinate>
          <northBoundingCoordinate>90</northBoundingCoordinate>
          <southBoundingCoordinate>-90</southBoundingCoordinate>
        </boundingCoordinates>
      </geographicCoverage>
          <temporalCoverage>
              <rangeOfDates>
                  <beginDate>
                    <calendarDate>1727-01-01</calendarDate>
                  </beginDate>
                <endDate>
                  <calendarDate>2021-03-01</calendarDate>
                </endDate>
              </rangeOfDates>
          </temporalCoverage>
          <taxonomicCoverage>
              <taxonomicClassification>
                  <taxonRankName>kingdom</taxonRankName>
                <taxonRankValue>Plantae</taxonRankValue>
                  <commonName>plants</commonName>
              </taxonomicClassification>
              <taxonomicClassification>
                  <taxonRankName>kingdom</taxonRankName>
                <taxonRankValue>Fungi</taxonRankValue>
                  <commonName>fungi</commonName>
              </taxonomicClassification>
              <taxonomicClassification>
                  <taxonRankName>kingdom</taxonRankName>
                <taxonRankValue>Chromista</taxonRankValue>
              </taxonomicClassification>
              <taxonomicClassification>
                  <taxonRankName>kingdom</taxonRankName>
                <taxonRankValue>Protozoa</taxonRankValue>
              </taxonomicClassification>
          </taxonomicCoverage>
  </coverage>
  <maintenance>
    <description>
      <para>monthly updates</para>
    </description>
    <maintenanceUpdateFrequency>unkown</maintenanceUpdateFrequency>
  </maintenance>

      <contact>
    <individualName>
        <givenName>Mathias</givenName>
      <surName>Dillen</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Researcher</positionName>
    <address>
        <country>BE</country>
    </address>
    <electronicMailAddress>mathias.dillen@plantentuinmeise.be</electronicMailAddress>
          <userId directory="http://orcid.org/">0000-0002-3973-1252</userId>
      </contact>
      <contact>
    <individualName>
        <givenName>Quentin</givenName>
      <surName>Groom</surName>
    </individualName>
    <organizationName>Meise Botanic Garden</organizationName>
    <positionName>Researcher</positionName>
    <address>
        <country>BE</country>
    </address>
    <electronicMailAddress>quentin.groom@plantentuinmeise.be</electronicMailAddress>
          <userId directory="http://orcid.org/">0000-0002-0596-5376</userId>
      </contact>
  <methods>
        <methodStep>
          <description>
            <para></para>
          </description>
        </methodStep>
  </methods>
  <project >
    <title>Digitale Ontsluiting Erfgoedcollecties</title>
      <personnel>
        <individualName>
            <givenName>Ann</givenName>
          <surName>Bogaerts</surName>
        </individualName>
        <role>curator</role>
      </personnel>
      <personnel>
        <individualName>
            <givenName>Sofie</givenName>
          <surName>De Smedt</surName>
        </individualName>
        <role>pointOfContact</role>
      </personnel>
      <personnel>
        <individualName>
            <givenName>Henry</givenName>
          <surName>Engledow</surName>
        </individualName>
        <role>editor</role>
      </personnel>
      <personnel>
        <individualName>
            <givenName>Quentin</givenName>
          <surName>Groom</surName>
        </individualName>
              <userId directory="http://orcid.org/">0000-0002-0596-5376</userId>
        <role>publisher</role>
      </personnel>
      <personnel>
        <individualName>
            <givenName>Paul</givenName>
          <surName>Van Wambeke</surName>
        </individualName>
        <role>programmer</role>
      </personnel>
      <personnel>
        <individualName>
            <givenName>Marc</givenName>
          <surName>Sosef</surName>
        </individualName>
              <userId directory="http://orcid.org/">0000-0002-6997-5813</userId>
        <role>user</role>
      </personnel>
      <abstract>
        <para>Digitization of the Meise Botanic Garden Herbarium</para>
      </abstract>
      <funding>
        <para>The Flemish Government</para>
      </funding>
  </project>
</dataset>
  <additionalMetadata>
    <metadata>
      <gbif>
          <dateStamp>2018-04-05T09:14:23.573+02:00</dateStamp>
          <hierarchyLevel>dataset</hierarchyLevel>
            <citation>Meise Botanic Garden (2020) Meise Botanic Garden Herbarium (BR). v1. Dataset/Occurrence. http://apm-ipt.br.fgov.be:8080/ipt/resource?r=botanical_collection</citation>
          <resourceLogoUrl>http://apm-ipt.br.fgov.be:8080/ipt/logo.do?r=botanical_collection</resourceLogoUrl>
          <dc:replaces>b740eaa0-0679-41dc-acb7-990d562dfa37/v1.18.xml</dc:replaces>
      </gbif>
    </metadata>
  </additionalMetadata>
</eml:eml>

These includes the main Meise catalog, and a University of Washington Burke Museum catalog ( i.e., http://www.botanicalcollections.be/specimen/BR0000005434497) and a taxonomic publication ( i.e., http://www.botanicalcollections.be/specimen/BR0000006953164 ):

Thomas Borsch, Hilda Flores-Olvera, Silvia Zumaya, Kai Müller (2018): Pollen characters and DNA sequence data converge on a monophyletic genus Iresine (Amaranthaceae, Caryophyllales) and help to elucidate its species diversity. Taxon 67 (5): 944-976, DOI: https://doi.org/10.12705/675.7

fyi @qgroom

jhpoelen commented 3 years ago

Bloom filters and their estimated overlap were also calculated for Arctos identifier scheme using:

# first traverse graph and calculate bloom filters
preston cat $REMOTE hash://sha256/1fd3e156c6ba1632a27b2bebaea36f76afeac8dfecf530d772988832821304ea\
| tee progress.log\
| ./preston-bloom grep --no-cache $REMOTE 'http[s]{0,1}://arctos.database.museum/guid/[a-zA-Z]+:[a-zA-Z]+:[a-zA-Z0-9().-]+[a-zA-Z0-9]' \
| ./preston-bloom bloom $REMOTE --no-cache\
| ./preston-bloom process

# then do diff
preston ls\
| ./preston-bloom diff\
| tee progress-diff.log\
| preston process

In total, 129 unique bloom filters were calculated in period "2021-03-29T22:37:41.370Z" to "2021-03-30T01:21:32.549Z" (a little under 3 hours) and the bloom diffs were calculated in period "2021-03-30T01:21:32.549Z" to "2021-03-30T05:18Z" (a little under 4 hours) with 3794 reported unique non-zero approximate overlapping bloom filters (~ 30 connections per bloom filter)

the complete list of content hashes ordered in descending connectivity:

number of datasets with overlapping arctos identifiers reference datasets
103 https://deeplinker.bio/cat/zip:hash://sha256/d498df855ebeb0714254ae27f095cc0d18d9a8a728edf004621515b8298ff43f!/eml.xml
67 https://deeplinker.bio/cat/zip:hash://sha256/bfdf8ab3cff3c974f1b1a1b30cc09b1f03067961944db084d8c46c3f9f4179c1!/eml.xml
65 https://deeplinker.bio/cat/zip:hash://sha256/d0d0a4047c8d884b6ed944fea41f1248125377da6996dee29d7b18fc6b9e5b5b!/eml.xml
64 https://deeplinker.bio/cat/zip:hash://sha256/58135d864837df3d4816e4d0c5a4a1c52f0a2608c74f30dbaaf1cabe135f9822!/eml.xml
63 https://deeplinker.bio/cat/zip:hash://sha256/e45e729a58dff71e68c1d9ea9cacbec347b9649af177b2703a3405e25cb00ad7!/eml.xml
61 https://deeplinker.bio/cat/zip:hash://sha256/bd568b0290aa50087841d8414a2e188c5406469fe3f40b55801ead2e40391208!/eml.xml
59 https://deeplinker.bio/cat/zip:hash://sha256/a8897e1185e09eddc48e5fca5f34fb9c7e92fba1b9a2ce1bbe77b0502ec4e33c!/eml.xml
57 https://deeplinker.bio/cat/zip:hash://sha256/30869702aa9fd66e555b94dd9eefaa653de1bbbc040feff6f40b9b6470769993!/eml.xml
57 https://deeplinker.bio/cat/zip:hash://sha256/1456e549483fd9ac97f028f0231f3c7888a63b72abfcd110446a4296dab25676!/eml.xml
56 https://deeplinker.bio/cat/zip:hash://sha256/a4826574c987ee11e86f44f80671802dda6ba8d6a2339fa28448ed39168961b5!/eml.xml
55 https://deeplinker.bio/cat/zip:hash://sha256/6cdcf277ce801a358bdf133168d0d147c7c65b9c40a8b173cddb855e7fb9fcfd!/eml.xml
54 https://deeplinker.bio/cat/zip:hash://sha256/5d248b7b9cebb64314460208d920d32fcf59e432770784c2f5eadf75e15716f8!/eml.xml
54 https://deeplinker.bio/cat/zip:hash://sha256/25dcbd8bba1aef04c520b76d5f94192bd47085ec75eef5e39e803730c1ba90bf!/eml.xml
54 https://deeplinker.bio/cat/zip:hash://sha256/1bd009d1ebcd061656441957b71bb8c1319b52bb4f712fb1a9f3cd89f880c5f9!/eml.xml
53 https://deeplinker.bio/cat/zip:hash://sha256/fcd629c0c8ebe92f560c6392ddf26975a53bc0343eba557a809186c5188bbbe1!/eml.xml
52 https://deeplinker.bio/cat/zip:hash://sha256/eb37acf410bee39bcb2327ea568fb847812dddf157e36e2296bfa72d7e3416ef!/eml.xml
52 https://deeplinker.bio/cat/zip:hash://sha256/80ed6a98fbe409e6f71f1fbcc50d63fa54ec1497827698868cddaf363c6a41ef!/eml.xml
51 https://deeplinker.bio/cat/zip:hash://sha256/e35753ba294673339b986a1065cef95e8f55679019367ec8f0008cd50ed3263d!/eml.xml
51 https://deeplinker.bio/cat/zip:hash://sha256/8b598e57b2cd3089a27fc3ee1c77966ba5c27e7ffd6dbefb8222b6d7e5f5e092!/eml.xml
51 https://deeplinker.bio/cat/zip:hash://sha256/32f75f27c82606d3f90b384618859dcfb4e30aa27a7541480ac9d592fd1eba68!/eml.xml
51 https://deeplinker.bio/cat/zip:hash://sha256/29f14c34799c03e095bc30b720a29b9a67ae2abfe4757e0e5c0d9b5d4ffea8c0!/eml.xml
50 https://deeplinker.bio/cat/zip:hash://sha256/dd195fa31c3bb298cdaccbcb5fe77ae64e18f193bdc73f238ad01cc642e7d7b3!/eml.xml
50 https://deeplinker.bio/cat/zip:hash://sha256/53ccb18c25f7fd87baaef785ecb78b3344ae2b598b177bb7994ee56eb57c540b!/eml.xml
50 https://deeplinker.bio/cat/zip:hash://sha256/413a19320aa365b9c13a93af20cc4f6f6f7a00129b68e92780c86f5da2792b13!/eml.xml
48 https://deeplinker.bio/cat/zip:hash://sha256/a34c2390e555e6b01388cdc3e2710809b8c8b441047be601814259ca0fa3db76!/eml.xml
47 https://deeplinker.bio/cat/zip:hash://sha256/df0ae8103de11fe57a5fd182de94092727095d00fef892dbc20566ffef0a46ce!/eml.xml
47 https://deeplinker.bio/cat/zip:hash://sha256/c642adb715feee454f863fba91a996c6b25509fa9ef193997f78af9985be40d8!/eml.xml
46 https://deeplinker.bio/cat/zip:hash://sha256/b81c1b18fe4449ffd8dc7e405af9e661f676a67ef7c2454227b5db082e093fb1!/eml.xml
46 https://deeplinker.bio/cat/zip:hash://sha256/34f87a9866b09ed8cec1bd7525c26f4618055b7e2b45f270b24f8dbde8494d7e!/eml.xml
46 https://deeplinker.bio/cat/zip:hash://sha256/21b9da174c7bea66eb2add3f13fe3b7de1c6a6bd019f8ac96a9d3dce6160d566!/eml.xml
46 https://deeplinker.bio/cat/zip:hash://sha256/1cf37420832aa0c3ffcb52ca78b715be999ac3ae1f25acef002f43e54b113dea!/eml.xml
44 https://deeplinker.bio/cat/zip:hash://sha256/b0237a3e2033aab0a5c40b5f07f67f0b470e34a404ccbe12d3bb6bae7bcfeb37!/eml.xml
42 https://deeplinker.bio/cat/zip:hash://sha256/f6b77cde9b3e4955a5b7e1463733c0d60af65f75468a252a133ceef510a9df96!/eml.xml
42 https://deeplinker.bio/cat/zip:hash://sha256/bd08b84262f8d4ae7c39e33bf85e3f48970fe3dc40840abc39a3082640468ab2!/eml.xml
42 https://deeplinker.bio/cat/zip:hash://sha256/4739ba22073cdd56934fae9959ac6ab3cb97ad7cee6184bd5d77ca27344203b5!/eml.xml
42 https://deeplinker.bio/cat/zip:hash://sha256/15899373e87144a9c732252ce6b55f7af67fe4385f1383a3b3fb84c78a6b4bc6!/eml.xml
40 https://deeplinker.bio/cat/zip:hash://sha256/da0ff4112229c2669df394b3843fcb430982a6669f05e6e90072e954ddc35bff!/eml.xml
40 https://deeplinker.bio/cat/zip:hash://sha256/8b58093d41be94e7dece782537a7948e64ecb090d6684bf582967e510609f4c3!/eml.xml
40 https://deeplinker.bio/cat/zip:hash://sha256/86fa07ae2d2a09283f625f0735090e0044ffc1249c5b5007503f5aa63a92259e!/eml.xml
39 https://deeplinker.bio/cat/zip:hash://sha256/c37154ff40d80eef5522fd7ed9190ac8d812adfcf25ea3a88f037205c2fbaa0e!/eml.xml
39 https://deeplinker.bio/cat/zip:hash://sha256/893b38f9fc61b8cc2bd619419194e5076d122eefdf9b48f733e9ae7e1797d49d!/eml.xml
39 https://deeplinker.bio/cat/zip:hash://sha256/6040e5c690fffd88df4a5a3aabad23cda0e2b5432b273d1f11a3246448cab1d1!/eml.xml
38 https://deeplinker.bio/cat/zip:hash://sha256/752c7207c98024c12a8ffb64b0d6423181ba5f5305c0fe53030ae91877571cd4!/eml.xml
38 https://deeplinker.bio/cat/zip:hash://sha256/069ae7fb715ee607ea92582767cdae429e6e705ae82f39cf027010cf7f203d48!/eml.xml
37 https://deeplinker.bio/cat/zip:hash://sha256/b1604fee95df02d5bb2aa5bbfe6692e3a1262ab923bad7d5887aa511722c6e4f!/eml.xml
37 https://deeplinker.bio/cat/zip:hash://sha256/6ded41cc7ec6f0fa1c488299d4999e99e21bc9a412d452e01e71c7b51a3a9dce!/eml.xml
37 https://deeplinker.bio/cat/zip:hash://sha256/28ab5a3dd2bf0def413c135837b107dec39f8f52d8205249470c49a2107561d7!/eml.xml
36 https://deeplinker.bio/cat/zip:hash://sha256/6ca45fc8e75ab43c61354426e135ce28b94d3d958bee9b6ff624e1875496c2b7!/eml.xml
34 https://deeplinker.bio/cat/zip:hash://sha256/cbc0c6b436878eef49a7c4b8551d0dece30d36f63d1fb51f842a78c854da75d9!/eml.xml
34 https://deeplinker.bio/cat/zip:hash://sha256/936254bb17c1a33bc97eac8dc04e3b5d8291e849de80c637f093b6167837c959!/eml.xml
34 https://deeplinker.bio/cat/zip:hash://sha256/6ca565f1654a7131a0e44d0a559ee11927e2abcd52bffa14baa93c928526e1e1!/eml.xml
33 https://deeplinker.bio/cat/zip:hash://sha256/a745aadb3bbe2263f57ac2c0c41a062ed6a0d4ec7d415578840bca616a1a6242!/eml.xml
32 https://deeplinker.bio/cat/zip:hash://sha256/10aeebea4344a92f9f0ccc462ca74476ef8ebabedcf854292fb9c4f689e59598!/eml.xml
31 https://deeplinker.bio/cat/zip:hash://sha256/f030de2d507e6fbb8b008090416f094f9e55441522691ecee5562e3c943d859c!/eml.xml
31 https://deeplinker.bio/cat/zip:hash://sha256/c3aaceb69572024e7baea9c932ce8adafae160e7d3b4bcc6cc2fbab15a1be80e!/eml.xml
31 https://deeplinker.bio/cat/zip:hash://sha256/2dec0c4f6b10f8ebf418b6b574729c7f6cd68267d5cec58e896654b80fc4f6f3!/eml.xml
30 https://deeplinker.bio/cat/zip:hash://sha256/4a26e4d747be5f2542f572146be3c2885f9ba1a4821daaa6a2bc9f34c0e21ca2!/eml.xml
28 https://deeplinker.bio/cat/zip:hash://sha256/befa0aabdf675f5adaaeeaabb5bb88690fc81f30d5b80a9e0eb902c5584118a3!/eml.xml
28 https://deeplinker.bio/cat/zip:hash://sha256/86a8a0e2f8b87f3b110fdf7b16888c79aea0a5eb337eea6df1a9abfc04d30037!/eml.xml
28 https://deeplinker.bio/cat/zip:hash://sha256/5e45fdecd916ba8550ee8443042962ba20418e99c13687208b1542b9d86e3b3c!/eml.xml
28 https://deeplinker.bio/cat/zip:hash://sha256/27ab6ebfd6c6baa8a4a7bc64504227f51482ba2181d2c880cad63cf8119fb60a!/eml.xml
27 https://deeplinker.bio/cat/zip:hash://sha256/ba20fc5d91eb82400bc03f1df4c1d175349f7946ea378167e99c30fdeabbd3c7!/eml.xml
27 https://deeplinker.bio/cat/zip:hash://sha256/9433518cb8c2833ae31e02b70bad217e2f72c57c03bbea1de16a882b89f44b98!/eml.xml
27 https://deeplinker.bio/cat/zip:hash://sha256/822e5d82d068d8cb6008c189f456d5d3d407c8c35a9f572e167289c448e6e24e!/eml.xml
27 https://deeplinker.bio/cat/zip:hash://sha256/238e2e744721f6bd174e668854915518a761dfb8f4c1d6c06f5920e827c58a48!/eml.xml
27 https://deeplinker.bio/cat/zip:hash://sha256/03e12f813aca56dfc4af8bf5e08d2dd09cbd95c76f5ae46c6471f086e81dac82!/eml.xml
26 https://deeplinker.bio/cat/zip:hash://sha256/b8a9c3cb70a7957656da502543fcbd50da71bf51daa29d227b59de32986decc0!/eml.xml
25 https://deeplinker.bio/cat/zip:hash://sha256/dfb38f5998e1a2f9a10c7ddb5519a1bcc53a1f525653bb0a52b6f0cd42a2f89c!/eml.xml
25 https://deeplinker.bio/cat/zip:hash://sha256/2bffe917329982cc53cd26f34344286c26421a0c2ea90db355c0ee02b6b2f55d!/eml.xml
24 https://deeplinker.bio/cat/zip:hash://sha256/b8e7224cd211d9ea636085875e44c2ce0452884c80c9ae0d12af2ec04aa438f0!/eml.xml
24 https://deeplinker.bio/cat/zip:hash://sha256/63297465d9345560cd53935e08b03a0478a2e430334ae5422ae76bde5b84f3ca!/eml.xml
24 https://deeplinker.bio/cat/zip:hash://sha256/3f9a2fcc0ff1c666c77d8011bb15ffe95d2e4a2476093756786bbfd5704a6b7c!/eml.xml
24 https://deeplinker.bio/cat/zip:hash://sha256/3942008b820e50e302e02720464adf1fa475598d110a8400be3d437c1c51e4b9!/eml.xml
24 https://deeplinker.bio/cat/zip:hash://sha256/20f4a60dbf45dafe230bda09cb7cc32ce53cc8902fd557c33843305418247f40!/eml.xml
23 https://deeplinker.bio/cat/zip:hash://sha256/46f94323d70f42c72972d7713f4253c44d9e452cad6ff746066553113fa4f47a!/eml.xml
22 https://deeplinker.bio/cat/zip:hash://sha256/974662af1ffdac63c13ffa5d426d0accddb4bd94aef857f65e982f5373517086!/eml.xml
22 https://deeplinker.bio/cat/zip:hash://sha256/711c7b903326b456f3f342139a06faf3fdaac80dd643d2347a8dbcc8d07d3102!/eml.xml
21 https://deeplinker.bio/cat/zip:hash://sha256/adfef31f401d339dca6f985f80bece874cf1be2ae60822c9fa9c44a5b4cbff27!/eml.xml
21 https://deeplinker.bio/cat/zip:hash://sha256/a16858acd02b8538f456c9c4ad4b3478a16eb3e88f3e48bc2f66038c810e74f9!/eml.xml
20 https://deeplinker.bio/cat/zip:hash://sha256/1a917c68e159bd37510147c4a3fe489085f4871a0257fecd8c6d55f021714cc9!/eml.xml
19 https://deeplinker.bio/cat/zip:hash://sha256/faf3e7ec497f9113a86ee118b1f62c7c281e8795a4e2eca6e0196c07d0a7cbd9!/eml.xml
19 https://deeplinker.bio/cat/zip:hash://sha256/741dc023f9091836097d727b44a91df28a09efc67d675e207c4d3367789a2919!/eml.xml
19 https://deeplinker.bio/cat/zip:hash://sha256/39119adac31024807bbfaa8306bb0c9e3d1b858ece3e8c02b4b3e1cd5cf94083!/eml.xml
18 https://deeplinker.bio/cat/zip:hash://sha256/e0803aea4f1e2c1d82cba712e35d8c6430241c2ff2a83d77cf7da8ccf832c4b8!/eml.xml
18 https://deeplinker.bio/cat/zip:hash://sha256/b16d0cdee319109625a82c361fefcf5815bec7c6ea3c2c64f4d950ce09cbbdab!/eml.xml
18 https://deeplinker.bio/cat/zip:hash://sha256/9e610b91a8b6403215dd51b5c22806e7baea143d98138576c9d46dae97dbd9f9!/eml.xml
18 https://deeplinker.bio/cat/zip:hash://sha256/774feed2f2531ca53a41d1911c69b749891e228b77c7af9eec8c691175d8431d!/eml.xml
18 https://deeplinker.bio/cat/zip:hash://sha256/67b8407b1afa40de25e968cb9309ee267117c29932a26f5cc82cb1f4d0147da4!/eml.xml
17 https://deeplinker.bio/cat/zip:hash://sha256/f9ccc03e1ba576662de182812a8c0c80b43c9009ead632e93e33a902eb1a37ac!/eml.xml
17 https://deeplinker.bio/cat/zip:hash://sha256/93bbc57b475d848f6d342ca25e9c9747a1ff410250a40ebb0096080d3d40885f!/eml.xml
16 https://deeplinker.bio/cat/zip:hash://sha256/4ff313ef69a26994b5a2c8eb3a78af803a718668ebec51b1311ca2c21cdd154e!/eml.xml
16 https://deeplinker.bio/cat/zip:hash://sha256/4d225e33972d6389cea6625ee94b6653a13740efc3d9ad82f3645868fedb8f30!/eml.xml
15 https://deeplinker.bio/cat/zip:hash://sha256/db8299ac524aa941142e0f372c286fde4b0746d3f77e03135beb555fa918d7b6!/eml.xml
15 https://deeplinker.bio/cat/zip:hash://sha256/377a1770a4fa0d240986cacdcb2aa4215b5a6872700d3c6fd4c2291a25832f71!/eml.xml
15 https://deeplinker.bio/cat/zip:hash://sha256/307367a7229b25ca11a9d5f64b0d53949cd52cf450a536ade9c0e423adb9aea6!/eml.xml
14 https://deeplinker.bio/cat/zip:hash://sha256/e970659bfbcfba40353b99b6d218f304ba457edf4beae9ccf84d2351cf4568b0!/eml.xml
14 https://deeplinker.bio/cat/zip:hash://sha256/c4b2a914eca2cbf56304d4557080bcbd96e47a73eee03e9b3ba4d902e31eb754!/eml.xml
14 https://deeplinker.bio/cat/zip:hash://sha256/bd8b2f809ccaf65d5f1fb8c034667ff74c8d1ef28e666e41730a6c54f84d11d2!/eml.xml
14 https://deeplinker.bio/cat/zip:hash://sha256/701f5a1ecfc075d5358d8c3529674e81e15158cc38d60716c862ebbacee73c51!/eml.xml
13 https://deeplinker.bio/cat/zip:hash://sha256/b4d7fde6c56261432c4904ff444d3046094a29f9f8d53f0b099bdfd666c264a4!/eml.xml
13 https://deeplinker.bio/cat/zip:hash://sha256/b49d7a4ab7629b3fb8ef670eb7aca4ba29c0eabdc846b12ee848936b11fb061c!/eml.xml
13 https://deeplinker.bio/cat/zip:hash://sha256/3be1843ed513a38918e6d0ed13540ca922c55d0d2c1e23a7cf55d8a35c1315a8!/eml.xml
12 https://deeplinker.bio/cat/zip:hash://sha256/95d6ab5426eab31ffb9ce1eda0c6bd0e14b1f9f63c0f1c9c8e4c17915f4caf46!/eml.xml
12 https://deeplinker.bio/cat/zip:hash://sha256/68dfb138300a54b0877196cbe65f27298468799a85540fb56558d87e51b83f39!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/9d2998ac0c3b145a363c2bdae7f4bd0a5be04e653559d7dee0717f4341fa55af!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/478213f06ab5edad7447cf5e37ce4c7a6c5e9291afb144786547db5bdc89cf7f!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/2d2f790eed42bb5ba7cd7465703ee9cd9de96bbd4edc7bfe0b6732bb7666f396!/eml.xml
10 https://deeplinker.bio/cat/zip:hash://sha256/d13253d89bda1fe176196e50b52e2489de70260076720b6ca4834d3af659db01!/eml.xml
10 https://deeplinker.bio/cat/zip:hash://sha256/5160bec035b6704245795af5ff0978c76addec823ae1b81d04b78997a1deb7a4!/eml.xml
10 https://deeplinker.bio/cat/zip:hash://sha256/469f33e0f8fc1eaff284ed0691c6b0a652bb2d4cd7b078334c037913b609e7f3!/eml.xml
10 https://deeplinker.bio/cat/zip:hash://sha256/1eef2ad8924bbcab106859633594ec094f5c7b97915988f4824a0b90332eac20!/eml.xml
9 https://deeplinker.bio/cat/zip:hash://sha256/d44feac6a5bd0d36ccbbb2d8675664b0dbe78d20d714acf80a21cbb7eb1d90d6!/eml.xml
9 https://deeplinker.bio/cat/zip:hash://sha256/533843c011aaad4f080d4af474f517833e29a369e1c0ee4f94b429e31296bfb5!/eml.xml
9 https://deeplinker.bio/cat/zip:hash://sha256/4fe351e602564c7ccd9ac15a3f27daed01226c83b6751ce26f3c9eb1597874a4!/eml.xml
8 https://deeplinker.bio/cat/zip:hash://sha256/8c53d9e97dc5ded35257ed956b25e88f6048ff8f2eb5ae4631ee1cbef447bec2!/eml.xml
8 https://deeplinker.bio/cat/zip:hash://sha256/4a05fbb4ef305501a30509504538b6567a6e06a0eed15d92c69cc625398d29c1!/eml.xml
8 https://deeplinker.bio/cat/zip:hash://sha256/18f5fcebbed601c5ff6beaba9eff2617e56f81ff1101efa03e23c8e2b85bdb54!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/f5925016562e959be8cedf5c3a8bccd859d801e1edf410762c28e020c9c75602!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/19b349e0c6e1a28c17d9f15d04db16d9798c965a4e1d99d5d5e56418a20ac640!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/e794d4aa09e5df42a84ae3e536e9f0311680cdafb7884080a3d44b28d7e59cfe!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/d5ebbfc9e681bfa4fa45a128fb270cad6b6992f3f2a80926ef53d3978c87f8f5!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/a7a140ab15522117851f32a2dcab05c30052aafda37f58a4df922a8124dcc5c8!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/24f78e8c56473d4a346fa4575cdb3daf02a2439e972aa3dbf5c9629cb1d15199!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/cc339030bf6fdd1105872c3362ce5cb7fed8193ddd0258aaafab6bc031ae58ad!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/51f687e7c87216f367d1fa2b13a5dc13adf1a09feb55591870910da5a23d0fb0!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/484816893ce3c303ebe1aae83e76d519b5fd1a08e96d838a7946f516c306ce7b!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/7dd0449778aad8a4806fd7be6ed9384522e84f20567fa01ee8915617b4922e51!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/561bfe9db45e22b44161449482c0c7b4db6325791c2d0f05b640136f8b8a48bc!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/544e12d30f8bd789c4b675e31cee8dcf9120b2d5d244fe7da16cac33ee1ec552!/eml.xml
qgroom commented 3 years ago

Now I have learned what I Bloom filter is 👍 I still don't really understand all the potential uses for this and it would be interesting to discuss possibilities.

jhpoelen commented 3 years ago

@qgroom one of the ways to use this is to periodically scan across iDigBio/GBIF (and potentially DataONE, BHL, etc) and report all datasets that use meise identifiers. In other words, you can generate a inverted reference list on specimen id level ("what dataset cited this record?").

I can imagine tons of other potential uses and I'd be happy to discuss whenever you have time.

jhpoelen commented 3 years ago

An alternative implementation to bloom filter was made, using Theta Sketches (see https://datasketches.apache.org/docs/Theta). Initial tests show a speed improvement of an order of magnitude as well as sketch size reduction. More tests are needed to compare to a previous tests with bloom filter implementation.

jhpoelen commented 3 years ago

An alternative to bloom filters, theta sketches, were implemented and used to estimate common terms across arctos datasets:

REMOTE="--remote file:///home/preston/preston-archive/data"

# first traverse graph and calculate theta sketches filters
preston cat $REMOTE hash://sha256/1fd3e156c6ba1632a27b2bebaea36f76afeac8dfecf530d772988832821304ea\
| tee progress.log\
| ./preston-bloom grep --no-cache $REMOTE 'http[s]{0,1}://arctos.database.museum/guid/[a-zA-Z]+:[a-zA-Z]+:[a-zA-Z0-9().-]+[a-zA-Z0-9]' \
| ./preston-bloom sketch --sketch-type=theta $REMOTE --no-cache\
| ./preston-bloom process

# then do diff
preston ls\
| ./preston-bloom diff\
| tee progress-diff.log\
| preston process

The calculation of the theta sketches started around 2021-04-09T00:55:34.866Z and ended before 2021-04-09T04:02:50.864Z , leaving about 3 hours of calculation. The diff calculation completed only minutes later, about two orders of magnitude faster than the equivalent bloom filter intersections that took about 2 hours to complete.

However, the dataset connectivity results were different:

estimated number of datasets with overlapping arctos identifiers reference dataset
80 https://deeplinker.bio/cat/zip:hash://sha256/d498df855ebeb0714254ae27f095cc0d18d9a8a728edf004621515b8298ff43f!/eml.xml
19 https://deeplinker.bio/cat/zip:hash://sha256/80ed6a98fbe409e6f71f1fbcc50d63fa54ec1497827698868cddaf363c6a41ef!/eml.xml
16 https://deeplinker.bio/cat/zip:hash://sha256/6ca565f1654a7131a0e44d0a559ee11927e2abcd52bffa14baa93c928526e1e1!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/a4826574c987ee11e86f44f80671802dda6ba8d6a2339fa28448ed39168961b5!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/6cdcf277ce801a358bdf133168d0d147c7c65b9c40a8b173cddb855e7fb9fcfd!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/32f75f27c82606d3f90b384618859dcfb4e30aa27a7541480ac9d592fd1eba68!/eml.xml
11 https://deeplinker.bio/cat/zip:hash://sha256/1456e549483fd9ac97f028f0231f3c7888a63b72abfcd110446a4296dab25676!/eml.xml
10 https://deeplinker.bio/cat/zip:hash://sha256/b0237a3e2033aab0a5c40b5f07f67f0b470e34a404ccbe12d3bb6bae7bcfeb37!/eml.xml
8 https://deeplinker.bio/cat/zip:hash://sha256/e45e729a58dff71e68c1d9ea9cacbec347b9649af177b2703a3405e25cb00ad7!/eml.xml
7 https://deeplinker.bio/cat/zip:hash://sha256/a8897e1185e09eddc48e5fca5f34fb9c7e92fba1b9a2ce1bbe77b0502ec4e33c!/eml.xml
7 https://deeplinker.bio/cat/zip:hash://sha256/10aeebea4344a92f9f0ccc462ca74476ef8ebabedcf854292fb9c4f689e59598!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/b16d0cdee319109625a82c361fefcf5815bec7c6ea3c2c64f4d950ce09cbbdab!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/9433518cb8c2833ae31e02b70bad217e2f72c57c03bbea1de16a882b89f44b98!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/8b598e57b2cd3089a27fc3ee1c77966ba5c27e7ffd6dbefb8222b6d7e5f5e092!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/3942008b820e50e302e02720464adf1fa475598d110a8400be3d437c1c51e4b9!/eml.xml
6 https://deeplinker.bio/cat/zip:hash://sha256/1bd009d1ebcd061656441957b71bb8c1319b52bb4f712fb1a9f3cd89f880c5f9!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/b81c1b18fe4449ffd8dc7e405af9e661f676a67ef7c2454227b5db082e093fb1!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/8b58093d41be94e7dece782537a7948e64ecb090d6684bf582967e510609f4c3!/eml.xml
5 https://deeplinker.bio/cat/zip:hash://sha256/39119adac31024807bbfaa8306bb0c9e3d1b858ece3e8c02b4b3e1cd5cf94083!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/fcd629c0c8ebe92f560c6392ddf26975a53bc0343eba557a809186c5188bbbe1!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/f030de2d507e6fbb8b008090416f094f9e55441522691ecee5562e3c943d859c!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/eb37acf410bee39bcb2327ea568fb847812dddf157e36e2296bfa72d7e3416ef!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/d0d0a4047c8d884b6ed944fea41f1248125377da6996dee29d7b18fc6b9e5b5b!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/30869702aa9fd66e555b94dd9eefaa653de1bbbc040feff6f40b9b6470769993!/eml.xml
4 https://deeplinker.bio/cat/zip:hash://sha256/28ab5a3dd2bf0def413c135837b107dec39f8f52d8205249470c49a2107561d7!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/e35753ba294673339b986a1065cef95e8f55679019367ec8f0008cd50ed3263d!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/df0ae8103de11fe57a5fd182de94092727095d00fef892dbc20566ffef0a46ce!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/86a8a0e2f8b87f3b110fdf7b16888c79aea0a5eb337eea6df1a9abfc04d30037!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/752c7207c98024c12a8ffb64b0d6423181ba5f5305c0fe53030ae91877571cd4!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/6ded41cc7ec6f0fa1c488299d4999e99e21bc9a412d452e01e71c7b51a3a9dce!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/5160bec035b6704245795af5ff0978c76addec823ae1b81d04b78997a1deb7a4!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/27ab6ebfd6c6baa8a4a7bc64504227f51482ba2181d2c880cad63cf8119fb60a!/eml.xml
3 https://deeplinker.bio/cat/zip:hash://sha256/069ae7fb715ee607ea92582767cdae429e6e705ae82f39cf027010cf7f203d48!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/f9ccc03e1ba576662de182812a8c0c80b43c9009ead632e93e33a902eb1a37ac!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/db8299ac524aa941142e0f372c286fde4b0746d3f77e03135beb555fa918d7b6!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/da0ff4112229c2669df394b3843fcb430982a6669f05e6e90072e954ddc35bff!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/d5ebbfc9e681bfa4fa45a128fb270cad6b6992f3f2a80926ef53d3978c87f8f5!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/bd8b2f809ccaf65d5f1fb8c034667ff74c8d1ef28e666e41730a6c54f84d11d2!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/bd568b0290aa50087841d8414a2e188c5406469fe3f40b55801ead2e40391208!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/b8e7224cd211d9ea636085875e44c2ce0452884c80c9ae0d12af2ec04aa438f0!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/a16858acd02b8538f456c9c4ad4b3478a16eb3e88f3e48bc2f66038c810e74f9!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/893b38f9fc61b8cc2bd619419194e5076d122eefdf9b48f733e9ae7e1797d49d!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/86fa07ae2d2a09283f625f0735090e0044ffc1249c5b5007503f5aa63a92259e!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/741dc023f9091836097d727b44a91df28a09efc67d675e207c4d3367789a2919!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/711c7b903326b456f3f342139a06faf3fdaac80dd643d2347a8dbcc8d07d3102!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/63297465d9345560cd53935e08b03a0478a2e430334ae5422ae76bde5b84f3ca!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/4739ba22073cdd56934fae9959ac6ab3cb97ad7cee6184bd5d77ca27344203b5!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/46f94323d70f42c72972d7713f4253c44d9e452cad6ff746066553113fa4f47a!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/413a19320aa365b9c13a93af20cc4f6f6f7a00129b68e92780c86f5da2792b13!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/34f87a9866b09ed8cec1bd7525c26f4618055b7e2b45f270b24f8dbde8494d7e!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/2dec0c4f6b10f8ebf418b6b574729c7f6cd68267d5cec58e896654b80fc4f6f3!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/29f14c34799c03e095bc30b720a29b9a67ae2abfe4757e0e5c0d9b5d4ffea8c0!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/238e2e744721f6bd174e668854915518a761dfb8f4c1d6c06f5920e827c58a48!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/21b9da174c7bea66eb2add3f13fe3b7de1c6a6bd019f8ac96a9d3dce6160d566!/eml.xml
2 https://deeplinker.bio/cat/zip:hash://sha256/15899373e87144a9c732252ce6b55f7af67fe4385f1383a3b3fb84c78a6b4bc6!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/faf3e7ec497f9113a86ee118b1f62c7c281e8795a4e2eca6e0196c07d0a7cbd9!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/dfb38f5998e1a2f9a10c7ddb5519a1bcc53a1f525653bb0a52b6f0cd42a2f89c!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/dd195fa31c3bb298cdaccbcb5fe77ae64e18f193bdc73f238ad01cc642e7d7b3!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/d44feac6a5bd0d36ccbbb2d8675664b0dbe78d20d714acf80a21cbb7eb1d90d6!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/d13253d89bda1fe176196e50b52e2489de70260076720b6ca4834d3af659db01!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/cc339030bf6fdd1105872c3362ce5cb7fed8193ddd0258aaafab6bc031ae58ad!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/cbc0c6b436878eef49a7c4b8551d0dece30d36f63d1fb51f842a78c854da75d9!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/c642adb715feee454f863fba91a996c6b25509fa9ef193997f78af9985be40d8!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/c3aaceb69572024e7baea9c932ce8adafae160e7d3b4bcc6cc2fbab15a1be80e!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/c37154ff40d80eef5522fd7ed9190ac8d812adfcf25ea3a88f037205c2fbaa0e!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/befa0aabdf675f5adaaeeaabb5bb88690fc81f30d5b80a9e0eb902c5584118a3!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/bd08b84262f8d4ae7c39e33bf85e3f48970fe3dc40840abc39a3082640468ab2!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/ba20fc5d91eb82400bc03f1df4c1d175349f7946ea378167e99c30fdeabbd3c7!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/b8a9c3cb70a7957656da502543fcbd50da71bf51daa29d227b59de32986decc0!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/b1604fee95df02d5bb2aa5bbfe6692e3a1262ab923bad7d5887aa511722c6e4f!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/adfef31f401d339dca6f985f80bece874cf1be2ae60822c9fa9c44a5b4cbff27!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/a34c2390e555e6b01388cdc3e2710809b8c8b441047be601814259ca0fa3db76!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/9e610b91a8b6403215dd51b5c22806e7baea143d98138576c9d46dae97dbd9f9!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/9d2998ac0c3b145a363c2bdae7f4bd0a5be04e653559d7dee0717f4341fa55af!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/974662af1ffdac63c13ffa5d426d0accddb4bd94aef857f65e982f5373517086!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/95d6ab5426eab31ffb9ce1eda0c6bd0e14b1f9f63c0f1c9c8e4c17915f4caf46!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/93bbc57b475d848f6d342ca25e9c9747a1ff410250a40ebb0096080d3d40885f!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/936254bb17c1a33bc97eac8dc04e3b5d8291e849de80c637f093b6167837c959!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/8c53d9e97dc5ded35257ed956b25e88f6048ff8f2eb5ae4631ee1cbef447bec2!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/822e5d82d068d8cb6008c189f456d5d3d407c8c35a9f572e167289c448e6e24e!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/774feed2f2531ca53a41d1911c69b749891e228b77c7af9eec8c691175d8431d!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/701f5a1ecfc075d5358d8c3529674e81e15158cc38d60716c862ebbacee73c51!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/6ca45fc8e75ab43c61354426e135ce28b94d3d958bee9b6ff624e1875496c2b7!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/67b8407b1afa40de25e968cb9309ee267117c29932a26f5cc82cb1f4d0147da4!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/6040e5c690fffd88df4a5a3aabad23cda0e2b5432b273d1f11a3246448cab1d1!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/5e45fdecd916ba8550ee8443042962ba20418e99c13687208b1542b9d86e3b3c!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/5d248b7b9cebb64314460208d920d32fcf59e432770784c2f5eadf75e15716f8!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/58135d864837df3d4816e4d0c5a4a1c52f0a2608c74f30dbaaf1cabe135f9822!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/53ccb18c25f7fd87baaef785ecb78b3344ae2b598b177bb7994ee56eb57c540b!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/51f687e7c87216f367d1fa2b13a5dc13adf1a09feb55591870910da5a23d0fb0!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/4ff313ef69a26994b5a2c8eb3a78af803a718668ebec51b1311ca2c21cdd154e!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/4d225e33972d6389cea6625ee94b6653a13740efc3d9ad82f3645868fedb8f30!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/4a26e4d747be5f2542f572146be3c2885f9ba1a4821daaa6a2bc9f34c0e21ca2!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/478213f06ab5edad7447cf5e37ce4c7a6c5e9291afb144786547db5bdc89cf7f!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/3f9a2fcc0ff1c666c77d8011bb15ffe95d2e4a2476093756786bbfd5704a6b7c!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/377a1770a4fa0d240986cacdcb2aa4215b5a6872700d3c6fd4c2291a25832f71!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/307367a7229b25ca11a9d5f64b0d53949cd52cf450a536ade9c0e423adb9aea6!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/2bffe917329982cc53cd26f34344286c26421a0c2ea90db355c0ee02b6b2f55d!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/25dcbd8bba1aef04c520b76d5f94192bd47085ec75eef5e39e803730c1ba90bf!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/24f78e8c56473d4a346fa4575cdb3daf02a2439e972aa3dbf5c9629cb1d15199!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/20f4a60dbf45dafe230bda09cb7cc32ce53cc8902fd557c33843305418247f40!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/1cf37420832aa0c3ffcb52ca78b715be999ac3ae1f25acef002f43e54b113dea!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/1a917c68e159bd37510147c4a3fe489085f4871a0257fecd8c6d55f021714cc9!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/18f5fcebbed601c5ff6beaba9eff2617e56f81ff1101efa03e23c8e2b85bdb54!/eml.xml
1 https://deeplinker.bio/cat/zip:hash://sha256/03e12f813aca56dfc4af8bf5e08d2dd09cbd95c76f5ae46c6471f086e81dac82!/eml.xml
jhpoelen commented 3 years ago

The bloom filter intersect detected an overlap between:

hash://sha256/30869702aa9fd66e555b94dd9eefaa653de1bbbc040feff6f40b9b6470769993 (MVZ Bird Collection (Arctos))

and

hash://sha256/544e12d30f8bd789c4b675e31cee8dcf9120b2d5d244fe7da16cac33ee1ec552 (NatureServe Network Species Occurrence Data)

but the associated theta sketches -

<theta:hash://sha256/a6e5b8c04c87552d91fce9983bd571eb9bf2651462f0708bf49783709e12cf5f> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/30869702aa9fd66e555b94dd9eefaa653de1bbbc040feff6f40b9b6470769993>

<theta:hash://sha256/45e09dd8f88939469307c0dcd0e83cfbb8fd476e57e6abcb067fbe8648991b58> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/544e12d30f8bd789c4b675e31cee8dcf9120b2d5d244fe7da16cac33ee1ec552>

reported an empty intersection.

jhpoelen commented 3 years ago

which makes sense, because, the upper/lower bound of the estimate of the "big" set (the birds with est. 192717 distinct arctos ids) is much bigger than the size of the "small" set (the natureserve set with est. 3 distinct arctos ids).

jhpoelen commented 3 years ago

Theta sketches are designed to work with large sets and care should be taken when dealing with small sketches (few number of identifiers per sketch). For biodiversity data research, I imagine an entire PhD thesis can be devoted to picking suitable sketch families or configurations (bloom/theta) to answer specific questions. For more information see e.g., https://datasketches.apache.org/docs/Theta/ThetaPSampling.html .

jhpoelen commented 2 years ago

related to #129

jhpoelen commented 2 years ago

The included examples show that you can use bloom filters and theta sketches to calculate common terms across datasets.