bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

create graph showing preserved specimen image count over time #168

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

as suggested by @qgroom

img_20220419_211240

jhpoelen commented 2 years ago

sketch of workflow:

preston ls --remote https://deeplinker.bio\
 | preston dwc-stream\
 | (insert magic jq selectors)\
 | (do name resolution)\
 | group by prov hash
jhpoelen commented 2 years ago

hey @qgroom -

I am looking to generate your graph of # images / time (still interested?) .

And, now I am trying to list the magic list of image URIs to use:

So far, I have:

http://rs.tdwg.org/ac/terms/accessURI http://rs.tdwg.org/ac/terms/thumbnailAccessURI http://rs.tdwg.org/ac/terms/goodQualityAccessURI

and digging for some more.

Can you think of some other URIs that are used to publish image urls in?

jhpoelen commented 2 years ago

Also, perhaps:

in http://rs.gbif.org/extension/gbif/1.0/images.xml and http://purl.org/dc/terms/type = StillImage

and

in http://rs.gbif.org/extension/gbif/1.0/multimedia.xml also http://purl.org/dc/terms/identifier with http://purl.org/dc/terms/type = StillImage

jhpoelen commented 2 years ago

@qgroom making small steps towards creating your graph and attempting to document methods at:

https://github.com/bio-guoda/preston-specimen-images

jhpoelen commented 2 years ago

Status update - on a single thread with spinning external hard disk connected to a 11 year old laptop, I was able to detect 115M image records in about 13 hours. Have yet to include only preserved specimen, assign time stamps to tracked dwca, and to taxonomic name alignment. Small steps . . .

Note that with SSD harddisk and parallel processing, I expect much faster processing speeds. The metrics documented here are pretty much worst case scenario performance wise.

$ ./find-image-records.sh | pv -l | gzip > image-records.json.gz
...

Screenshot from 2022-06-09 08-19-53

qgroom commented 2 years ago

Hi Jorrit, thanks for pushing forward on this and for my slow response! I had SPNHC followed by COVID so I'm getting way behind on emails, but I'm still very interested in the results! I'm guessing the number of images will drop considerably when you select only preserved specimens. Do you need taxonomic name alignment, aren't they already aligned to the GBIF Taxonomic Backbone?

jhpoelen commented 2 years ago

hey @qgroom - good to hear from you. Sounds like SPNCH was quite something.

I'm guessing the number of images will drop considerably when you select only preserved specimens.

Yes, preserved specimen are in the relative minority especially compared with human observations from eBird/ iNaturalist and friends. These account for about half of the total number of GBIF occurrences.

Do you need taxonomic name alignment, aren't they already aligned to the GBIF Taxonomic Backbone?

Preston is tracking the "raw" datasets as registered with GBIF/iDigBio. So, this does not include the interpreted data products that GBIF provides. I believe using the raw data is important to retain the provenance of the knowledge and allow for re-interpretation of taxonomic (or other) aspects of the provided datasets.

In other words, provided data is not yet aligned with any particular taxonomy: names are as provided. Nomer or similar tools can be used to perform taxonomic name alignment.

What, if any, taxonomic backbone would you like to align with? Which version?

jhpoelen commented 2 years ago

Here's some questions that come up for me as I am mulling over your earlier comments.

Would you individually count multiple images for a single specimen?

Or, would you consider a specimen with one of more images to be sufficient?

What if the image is a label image only?

qgroom commented 2 years ago

What, if any, taxonomic backbone would you like to align with? Which version?

I don't really mind. Given a choice I would just like to be able to separate the results in to some broad classifications, such as insects, plants, birds and mammals. However, even that is not 100% necessary. I think it would be more interesting for people if they were split up, but if it is too difficult it is not critical information for the paper.

Would you individually count multiple images for a single specimen?

No, I want the number of imaged specimens. I don't care if they have multiple images.

Or, would you consider a specimen with one of more images to be sufficient?

Yes

What if the image is a label image only?

That's good enough for me

The point is to demonstrate the growth in specimen digitization and how this is becoming an important corpus of images suitable for machine learning.

jhpoelen commented 2 years ago

@qgroom thanks for clarifying, I think I have enough information. Now the "only" think left is the run the analysis, do name alignment and collect results. . . hopefully more later soon. Please bug me if I am taking too long.

qgroom commented 2 years ago

Hi Jorrit, do you have any time to look into this? I want to submit the image infrastructure paper soon and it would be great if we could include this.

jhpoelen commented 2 years ago

am still catching up from my two weeks of travel.

When would you like to have it?

And, how much time can you or colleagues spend on this? From my last visit it seems that you, Matt and Pieter (e.g., @matdillen @PietrH) probably have the expertise to generate the imagined graph independently from publicly available data tracked by Preston. Would also be a way to independently verify the data tracking method and related tools (e.g., preston, nomer) . However, I do realize the learning curve might be a bit challenging.

Am trying to strike a balance between doing the work and delegating where I can. Please advise. I can try to see what I can do.

qgroom commented 2 years ago

am still catching up from my two weeks of travel.

Yes, I can imagine. I hope you had a good trip.

When would you like to have it?

ASAP. the paper is nearly ready to go

how much time can you or colleagues spend on this?

Not much, but some.

probably have the expertise to generate the imagined graph independently...

Yes, I think we could figure it out, but I doubt we could do it in time for this paper.

I think it is worth doing, but perhaps we should leave it out of the paper for now

jhpoelen commented 2 years ago

@qgroom thanks for clarifying. I'll try to get something together this week. If I can't make that, we might have to discuss alternatives.

qgroom commented 2 years ago

Thanks! Like I say, if it proves difficult to do quickly I still think it is worth doing, perhaps to add to the paper when it comes back for revisions. I can imagine many people will be interested in the result. Though I am very curious how it compares with my imagination

jhpoelen commented 2 years ago

I am working on this today - will give you update at the end of my day at the latest.

jhpoelen commented 2 years ago

@qgroom oef . I am amazed by how hard it is to make things simple.

I've completed a first (big) step - a script to select all archives seen at a specific time that contain still images as well as preserved specimen. Am running the script now. See https://github.com/jhpoelen/specimen-image-index .

A useful by catch of this exercise may be an exhaustive list of content specimen image archives (aka a specimen image index).

Just started the process to generate this list. As you might expect, this is a non-trivial exercise made somewhat doable using preston and friends.

Next step - link the image information to the taxonomic / specimen information, resolve related taxonomic information and put into bucket "insects" "plants" and "mammals".

Apologies for the delay. This is a labor and computationally intense task.

jhpoelen commented 2 years ago

For what it is worth - the make.sh in https://github.com/jhpoelen/specimen-image-index is build to run in a parallel workflow. Currently running > 8 cores in some undisclosed server somewhere in Europe.

jhpoelen commented 2 years ago

. . . hours later . . . finished a first pass at counting images across time.

Am running on the server as we speak.

The idea is to:

  1. take 10 (random) preston snapshots
  2. include only dwc-a with still images / preserved specimen
  3. lookup names associated with images
  4. align names against catalogue of life
  5. select only records with Mammalia, Insecta or Plantae
  6. append short prov id and snapshot date

resulting table should look something like:

name provIdShort provDate
Insecta 3df3 2022-07-22
Insecta 3df3 2022-07-22
... ... ...

where each row represents a named specimen records with one or more images. The names are either "Insecta", "Mammalia", or "Plantae" .

qgroom commented 2 years ago

Very cool! I think I can follow how the code works and I can see it is quite intensive. My goto method would have been to load it all into a database, but that would have taken even more effort I imagine.

take 10 (random) preston snapshots

How do you ensure you get a good spread over time? Or do I misunderstand this?

jhpoelen commented 2 years ago

How do you ensure you get a good spread over time? Or do I misunderstand this?

@qgroom great question! The random sample was a bit of a . . . random sample. If you'd like to cherry pick dates, you can do that too. What did you have in mind?

The snapshot sampling frequency has been monthly since about 2019-03 . Prior to that, weekly snapshots were taken starting 2018-09, but the logs have a slightly different format, making it a little more difficult to process with methods implemented.

Btw - make.sh was able to process about 4-5 snapshots overnight without too much optimization. However, I forgot to install "miller" https://miller.readthedocs.io/ on the server, so the last part of processing (join/align) is still missing. Have installed it now, and hoping to get some results later in the day.

We now have a neat list of tracked content with "traits" like "has still image" or "has preserved specimen". In other words, we've build a time series of digitized collections of preserved specimen with images as seen by iDigBio / GBIF / BioCaSE. That is something right? If you'd like I can share some preliminary results on that.

Also, yes, I agree that make.sh looks a little involved. I'd be curious to see alternate approach that you or your colleagues can come up with . . .

qgroom commented 2 years ago

The snapshot sampling frequency has been monthly since about 2019-03 . Prior to that, weekly snapshots were taken starting 2018-09, but the logs have a slightly different format, making it a little more difficult to process with methods implemented.

Perhaps we will end up with a reasonable time series by random chance.

Btw - make.sh was able to process about 4-5 snapshots overnight without too much optimization. However, I forgot to install "miller" https://miller.readthedocs.io/ on the server, so the last part of processing (join/align) is still missing. Have installed it now, and hoping to get some results later in the day.

I'm sitting on the edge of my chair :pray:

We now have a neat list of tracked content with "traits" like "has still image" or "has preserved specimen". In other words, we've build a time series of digitized collections of preserved specimen with images as seen by iDigBio / GBIF / BioCaSE. That is something right? If you'd like I can share some preliminary results on that.

That is indeed something! Yes, I'd love to see some results

Also, yes, I agree that make.sh looks a little involved. I'd be curious to see alternate approach that you or your colleagues can come up with . . .

It didn't look that bad. In fact with example code like this it would be possible to adapt it for other purposes as well. Most of my coding is done my copying examples 😉

jhpoelen commented 2 years ago

Ok, will keep you posted on results. Apologies for the suspense . . . :wink:

Also, as far as copy/pasting goes . . . I am a big fan of copy/pasting also.

Would it make sense to turn this into a little copy-paste-compatible mini-tutorial on how to process all the data using the streaming approaches that I used? If so, what format would you like this to be? Extended readme? Or something more involved?

qgroom commented 2 years ago

Would it make sense to turn this into a little copy-paste-compatible mini-tutorial on how to process all the data using the streaming approaches that I used? If so, what format would you like this to be? Extended readme? Or something more involved?

Certainly! An extended read me would be fine.

I'm wonder whether we could make this a box text within the paper. It might be a good advert for the method

jhpoelen commented 2 years ago

The first result just rolled in:

$ cat plantae_insecta_or_mammalia_image_251f.tsv | sort | uniq -c
 587806 Insecta 251f    2021-07-01
  19321 Mammalia    251f    2021-07-01
11324522 Plantae    251f    2021-07-01

Which would be: 11M plants / 588k insects / 19k mammals . . . preserved specimen records with still image observed via Preston's 2021-07-01 snapshot.

Seems a bit low, probably due to name alignment issues.

qgroom commented 2 years ago

We could run the results past @timrobertson100. He will probably know if we're in the right ball park

jhpoelen commented 2 years ago

For the 2021-07-01 snapshot 13M preserved specimen records with still images were extracted.

$ cat content-name-image_251f.tsv | wc -l
13874481

Apparently, our methods picked up only 705 datasets that contained still images:

$ cat content-with-still-images_251f.tsv | wc -l
705

and 1158 datasets with multimedia references in their schema.

$ cat content-with-multimedia_251f.tsv | wc -l
1158

so, the name alignment might not be a major factor. . . . can I walk you through the extraction process via video chat. This way, I can review the method I use to detect the preserved specimen and still images.

On the bright side, the method is applied consistently, so a subset might highlight a representative relative change in image availability for preserved specimen.

qgroom commented 2 years ago

I suppose it would be good to calculate the total of all taxa so that if there was loss due to names alignment we still have that total too. Is that doable without having to extract again?

jhpoelen commented 2 years ago

I've shared the intermediate results for the 2021-07-01 snapshot via temporary download link at: https://send.tresorit.com/a#ix7d4A9Tj1Ka4RUPXPRGDw

Curious to hear your thoughts.

timrobertson100 commented 2 years ago

We could run the results past @timrobertson100. He will probably know if we're in the right ball park

Using GBIF.org with filters for preserved specimen records having image you'd get 38M plant records, 638k mammal records and 3.4M insect records.

Please shout if you need anything beyond that.

jhpoelen commented 2 years ago

@timrobertson100 thanks for looking this up!

Some questions:

  1. Would this include a non-blank scientific name for the associated record?
  2. How do you normalize the stillimage records?
  3. Prior to parsing using the gbif name parser, do you do any name scrubbing?

According to my records for the 2021-07-01 snapshot (over 1 year ago), there's a bunch (>900k) of records with blank scientific names with still images, which might account for some of the differences.

Here's the top 10 most frequently occurring scientificNames in the tracked dataset.

$ zcat content-name-image_251f.tsv.gz | cut -f2 | sort | uniq -c |sort -nr | head 
 951474 
  12516 Demospongiae
  11204 Plantae
  10587 Carex
  10307 Fungi
   8927 Polystichum acrostichoides
   8032 Asplenium platyneuron
   8012 Equisetum arvense
   7976 Dicranum scoparium
   7969 Cyperus strigosus
jhpoelen commented 2 years ago

Also, please note that I also included a single record in case one or more images were present. So, if one specimen has 10 images, only a single row is included. @timrobertson100 do the results from the GBIF gallery count images or records with images?

jhpoelen commented 2 years ago

And, a bit suspicious that no mice can be found in the collection:

$ zcat content-name-image_251f.tsv.gz | grep "Mus musculus" | wc -l
0

whereas https://www.gbif.org/occurrence/gallery?basis_of_record=PRESERVED_SPECIMEN&media_type=StillImage&taxon_key=7429082&occurrence_status=present has a bunch of images. See attached screenshot.

Screenshot from 2022-08-18 11-31-43

jhpoelen commented 2 years ago

For the mouse example, I tracked down a dataset with some mice in it. And confirmed that these image are picked up using the developed image indexing method.

$ preston track "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"
...
$ cat content-name-image_cf58.tsv | grep "Mus"
262604  Mus musculus    cf58    2022-08-18
262604  Mus musculus    cf58    2022-08-18
262604  Mus musculus    cf58    2022-08-18
288952  Mus musculus    cf58    2022-08-18
288952  Mus musculus    cf58    2022-08-18
288952  Mus musculus    cf58    2022-08-18

So, it might just be a dataset availability issue. These issues should come out once more snapshots are processed. And more should be available soon. Thanks for being patient as I am keeping detailed notes on this. I am assuming that the problem lies in my method, so I am carefully tracing the evidence.

jhpoelen commented 2 years ago

And, going back to 2021-07-01, I can confirm that the same NEON dataset did include a mouse record . . . but not images were available for it as far as I can tell.

associated coreIds

$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
 | grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
 | preston dwc-stream --remote https://linker.bio\
 | grep "Mus mus"\
 | jq '.["http://rs.tdwg.org/dwc/text/id"]'
"262604"
"395539"

associated occurrenceIds

$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
 | grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
 | preston dwc-stream --remote https://linker.bio\
 | grep "Mus mus"\
 | jq '.["http://rs.tdwg.org/dwc/terms/occurrenceID"]'
"NEON01IAG"
"NEON01IPX"

But no images were present for these specimen records at that time:

$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
 | grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
 | preston dwc-stream --remote https://linker.bio\
 | grep "Mus mus"\
 | grep -E "262604|395539"\
 | grep StillImage\
 | wc -l
0

@timrobertson100 any way to confirm that at 2021-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?

matdillen commented 2 years ago

I wonder if there's a way to get a hold of the metadata of the images GBIF caches?

timrobertson100 commented 2 years ago

Would this include a non-blank scientific name for the associated record?

Yes, e.g. these 981k Plant records which match to higher ranks in the backbone - some may have names that the backbone doesn't deal with though. We'd also assemble names from e.g. genus and specificEpithet

How do you normalize the stillimage records?

As far as I can read, the same places as you do - dwc:associatedMedia, the Audubon Core, SimpleImage and MultimediaExtension and also the ABCD records. You can see which extensions are in play if you use GBIF.org occurrence search.

image

Prior to parsing using the gbif name parser, do you do any name scrubbing?

We use the species/match API primarily with a little cleaning beforehand (e.g. for Abies alba)

@timrobertson100 do the results from the GBIF gallery count images or records with images?

Records, not images

@timrobertson100 any way to confirm that at 2022-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?

Assuming that is the date you took a copy, the closest ingestion we have before that was on 2022-06-23 20:41. The archive we crawled is here (temp location). If you look in the multimedia file you'll find record ID 262604 has this image URL which is a mouse.

I hope this helps.

timrobertson100 commented 2 years ago

I wonder if there's a way to get a hold of the metadata of the images GBIF caches?

I might misunderstand something, but the full metadata should be in the related image extension included in the monthly DwC-A format snapshot we produce. Or do I misunderstand please?

matdillen commented 2 years ago

I wonder if there's a way to get a hold of the metadata of the images GBIF caches?

I might misunderstand something, but the full metadata should be in the related image extension included in the monthly DwC-A format snapshot we produce. Or do I misunderstand please?

Sorry I meant the image file metadata, which may include timestamps. The derivatives produced by the zoom tool exclude these, but there may have been more in the originals.

jhpoelen commented 2 years ago

@timrobertson100 thanks for providing an historic dwca .

Using your (custom?) link:

$ preston track https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca
...
<https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca> <http://purl.org/pav/hasVersion> <hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1> <urn:uuid:24027b6a-52f1-4bb1-a3f8-8fe5b156e78e> .
...
$ cat content-name-image_c3a2.tsv | grep Mus
262604  Mus musculus    c3a2    2022-08-18
262604  Mus musculus    c3a2    2022-08-18
262604  Mus musculus    c3a2    2022-08-18
288952  Mus musculus    c3a2    2022-08-18
288952  Mus musculus    c3a2    2022-08-18
288952  Mus musculus    c3a2    2022-08-18

confirming your observation that the mouse images exist.

However, the file you presented was different than the one I observed over a year ago on 2021-07-02T02:55:07.136Z as document in in provenance log below.

The dwca I saw had content id: hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d

whereas the file you shared via https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca had a content id of: hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1

I am now trying to figure out what why the two files are different.

$ preston cat --remote https://linker.bio 'line:hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417!/L234648-L234656'
<f28219f3-49cd-4b55-99ad-39f3ced87ddd> <http://www.w3.org/ns/prov#hadMember> <https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> .
<https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <http://purl.org/dc/elements/1.1/format> "application/dwca" <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> .
<hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#generatedAtTime> "2021-07-02T02:55:07.136Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#used> <https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <http://purl.org/pav/hasVersion> <hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
jhpoelen commented 2 years ago

Oops. After checking the eml of the file you shared via

$ preston cat 'zip:hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1!/eml.xml'\
 | xmllint -format -\
 | grep pubDate
    <pubDate>2022-06-23</pubDate>

I realized I made a (now corrected) typo in my earlier comment https://github.com/bio-guoda/preston/issues/168#issuecomment-1219729828 : I was referring to the date 2021-07-01, not 2022-07-01.

@timrobertson100 Apologies. Any chance you can retrieve your cached dwca from a year earlier?

jhpoelen commented 2 years ago

hey @qgroom I took some time to create some copy-paste examples for you to enjoy. I hope you do! And . . . curious to hear your thoughts on https://github.com/jhpoelen/specimen-image-index#specimen-image-index .

timrobertson100 commented 2 years ago

Sorry I meant the image file metadata, which may include timestamps. The derivatives produced by the zoom tool exclude these, but there may have been more in the originals.

Sorry @matdillen, we don't have that. Only the URLs I'm afraid, and the likes of Audobon core

timrobertson100 commented 2 years ago

Any chance you can retrieve your cached dwca from a year earlier?

@jhpoelen I've checked the archives, and confirm the 2 mouse records in the archives don't have images. I've added a few more archives on the temp location and the ingestion history will give you the dates.

timrobertson100 commented 2 years ago

I'm not sure it is helpful at this stage, but the way I'd recommend the original chart be made @qgroom is to download a series of the monthly DwC-A which date back to May 2018, and then script the join to count i.e. the equivalent of:

count(distinct gbifID) 
FROM interpreted JOIN multimedia 
WHERE basisOFRecord='PRESERVED_SPECIMEN' AND kingdomKey=...
jhpoelen commented 2 years ago

@timrobertson100 thanks for confirming that

2021-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?

Also, I leave it up to @qgroom to use whatever method suited to produce the desired results.

jhpoelen commented 2 years ago

@qgroom I manage to extract the following measurements using the make.sh script as described in https://github.com/jhpoelen/specimen-image-index .

Temporary download link to full results zip with intermediate results >400MB zipfile see https://send.tresorit.com/a#5wAllb_q0khty6N1SOplaA .

the table below was built using:

cat plantae_insecta_or_mammalia_image_*\
 | sort\
 | uniq -c\
 | mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1\
 | mlr --tsvlite sort -f 3\
 | mlr --itsvlite --omd cat
date prov id count name
2019-03-01 2611 375999 Insecta
2019-03-01 2611 10887 Mammalia
2019-03-01 2611 6275820 Plantae
2019-06-01 9a41 167034 Insecta
2019-06-01 9a41 10930 Mammalia
2019-06-01 9a41 6661441 Plantae
2019-07-01 b986 245668 Insecta
2019-07-01 b986 10647 Mammalia
2019-07-01 b986 6652128 Plantae
2020-08-01 8ff4 470966 Insecta
2020-08-01 8ff4 12012 Mammalia
2020-08-01 8ff4 7731709 Plantae
2021-07-01 251f 587806 Insecta
2021-07-01 251f 19321 Mammalia
2021-07-01 251f 11324522 Plantae
2022-07-01 da74 908204 Insecta
2022-07-01 da74 18470 Mammalia
2022-07-01 da74 13644513 Plantae

To give an idea of how many imaged preserved specimen had their names matched . . . the total amount of detected preserved specimen with some image per snapshot:

cat content-name-image_*\
 | cut -f3,4\
 | sort\
 | uniq -c\
 | mlr --implicit-csv-header --tsvlite reorder -f 2,1\
 | mlr --tsvlite sort -f 2\
 | mlr --itsvlite --omd cat
date total preserved specimen with image prov id
2019-03-01 8883424 2611
2019-06-01 8903058 9a41
2019-07-01 9027607 b986
2020-08-01 10141281 8ff4
2021-07-01 13874481 251f
2022-07-01 16522128 da74

This can be integrated into the first table if needed to get a sense of relative contributions of mammals, insects, and plants toward the total of detected preserved specimen as documented in make.sh.

date name count
2019-03-01 Insecta 375999
2019-03-01 Mammalia 10887
2019-03-01 Plantae 6275820
2019-03-01 any 8883424
2019-06-01 Insecta 167034
2019-06-01 Mammalia 10930
2019-06-01 Plantae 6661441
2019-06-01 any 8903058
2019-07-01 Insecta 245668
2019-07-01 Mammalia 10647
2019-07-01 Plantae 6652128
2019-07-01 any 9027607
2020-08-01 Insecta 470966
2020-08-01 Mammalia 12012
2020-08-01 Plantae 7731709
2020-08-01 any 10141281
2021-07-01 Insecta 587806
2021-07-01 Mammalia 19321
2021-07-01 Plantae 11324522
2021-07-01 any 13874481
2022-07-01 Insecta 908204
2022-07-01 Mammalia 18470
2022-07-01 Plantae 13644513
2022-07-01 any 16522128

I realize that the snapshot sampling is a bit funny, but we can rerun with more specific dates if needed.

Note also that the preserved specimen with images is not guaranteed to go up . . . consistent with that datasets are known to disappear or otherwise change. However, over all, it appears I was able to recreate your initial sketch. Lots of stuff to think about. Please let me know if you'd like to chat about this today.

jhpoelen commented 2 years ago

And . . . whats going on with those mammals? A bit camera shy? Or perhaps they write down their dwc-a differently than their botanist or entomologist colleagues.

jhpoelen commented 2 years ago

Some more summary data:

cat content-with-still-images_*.tsv | cut -f2,3 | sort | uniq -c | mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1 | mlr --tsvlite sort -f 2 | mlr --itsvlite --omd cat
date datasets with still images prov id
2019-03-01 387 2611
2019-03-31 545 5a39
2019-06-01 503 9a41
2019-07-01 525 b986
2020-08-01 608 8ff4
2020-11-01 585 d98e
2021-07-01 705 251f
2021-11-01 587 83b4
2022-07-01 899 da74

I would have expected this number to be a bit higher because I suspect that most plazi treatments have still images in them. And plazi treatments make up for most of the datasets registered in GBIF. If I remember correctly, there's over 10k plazi datasets registered with GBIF.

cat content-with-still-images-and-specimen_*.tsv | cut -f2,3 | sort | uniq -c | mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1 | mlr --tsvlite sort -f 2 | mlr --itsvlite --omd cat
date datasets with images and preserved specimen prov id
2019-03-01 371 2611
2019-06-01 374 9a41
2019-07-01 397 b986
2020-08-01 458 8ff4
2022-07-01 645 da74

This shows that, in addition to an increased number of registered images for preserved specimen, the number of datasets with preserved specimen went up also. It remains to be seen which collections drove most of the additions of preserved specimen images. Perhaps a neat follow-up question to consider.

jhpoelen commented 2 years ago

@timrobertson100 re: GBIF monthly snapshots - do you have similar snapshots of the source data that these snapshots were derived from?

Have access to your source data would help me verify the results above without adding the (useful) magic that GBIF applies to process the original data to the mix (e.g., compiling taxonomic names from various sources, scrubbing and parsing of taxonomic names, taxonomic name matching against some specific version of the GBIF backbone taxonomy, caching of no longer available datasets),