Closed jhpoelen closed 2 years ago
sketch of workflow:
preston ls --remote https://deeplinker.bio\
| preston dwc-stream\
| (insert magic jq selectors)\
| (do name resolution)\
| group by prov hash
hey @qgroom -
I am looking to generate your graph of # images / time (still interested?) .
And, now I am trying to list the magic list of image URIs to use:
So far, I have:
http://rs.tdwg.org/ac/terms/accessURI http://rs.tdwg.org/ac/terms/thumbnailAccessURI http://rs.tdwg.org/ac/terms/goodQualityAccessURI
and digging for some more.
Can you think of some other URIs that are used to publish image urls in?
Also, perhaps:
in http://rs.gbif.org/extension/gbif/1.0/images.xml and http://purl.org/dc/terms/type = StillImage
and
in http://rs.gbif.org/extension/gbif/1.0/multimedia.xml also http://purl.org/dc/terms/identifier with http://purl.org/dc/terms/type = StillImage
@qgroom making small steps towards creating your graph and attempting to document methods at:
Status update - on a single thread with spinning external hard disk connected to a 11 year old laptop, I was able to detect 115M image records in about 13 hours. Have yet to include only preserved specimen, assign time stamps to tracked dwca, and to taxonomic name alignment. Small steps . . .
Note that with SSD harddisk and parallel processing, I expect much faster processing speeds. The metrics documented here are pretty much worst case scenario performance wise.
$ ./find-image-records.sh | pv -l | gzip > image-records.json.gz
...
Hi Jorrit, thanks for pushing forward on this and for my slow response! I had SPNHC followed by COVID so I'm getting way behind on emails, but I'm still very interested in the results! I'm guessing the number of images will drop considerably when you select only preserved specimens. Do you need taxonomic name alignment, aren't they already aligned to the GBIF Taxonomic Backbone?
hey @qgroom - good to hear from you. Sounds like SPNCH was quite something.
I'm guessing the number of images will drop considerably when you select only preserved specimens.
Yes, preserved specimen are in the relative minority especially compared with human observations from eBird/ iNaturalist and friends. These account for about half of the total number of GBIF occurrences.
Do you need taxonomic name alignment, aren't they already aligned to the GBIF Taxonomic Backbone?
Preston is tracking the "raw" datasets as registered with GBIF/iDigBio. So, this does not include the interpreted data products that GBIF provides. I believe using the raw data is important to retain the provenance of the knowledge and allow for re-interpretation of taxonomic (or other) aspects of the provided datasets.
In other words, provided data is not yet aligned with any particular taxonomy: names are as provided. Nomer or similar tools can be used to perform taxonomic name alignment.
What, if any, taxonomic backbone would you like to align with? Which version?
Here's some questions that come up for me as I am mulling over your earlier comments.
Would you individually count multiple images for a single specimen?
Or, would you consider a specimen with one of more images to be sufficient?
What if the image is a label image only?
What, if any, taxonomic backbone would you like to align with? Which version?
I don't really mind. Given a choice I would just like to be able to separate the results in to some broad classifications, such as insects, plants, birds and mammals. However, even that is not 100% necessary. I think it would be more interesting for people if they were split up, but if it is too difficult it is not critical information for the paper.
Would you individually count multiple images for a single specimen?
No, I want the number of imaged specimens. I don't care if they have multiple images.
Or, would you consider a specimen with one of more images to be sufficient?
Yes
What if the image is a label image only?
That's good enough for me
The point is to demonstrate the growth in specimen digitization and how this is becoming an important corpus of images suitable for machine learning.
@qgroom thanks for clarifying, I think I have enough information. Now the "only" think left is the run the analysis, do name alignment and collect results. . . hopefully more later soon. Please bug me if I am taking too long.
Hi Jorrit, do you have any time to look into this? I want to submit the image infrastructure paper soon and it would be great if we could include this.
am still catching up from my two weeks of travel.
When would you like to have it?
And, how much time can you or colleagues spend on this? From my last visit it seems that you, Matt and Pieter (e.g., @matdillen @PietrH) probably have the expertise to generate the imagined graph independently from publicly available data tracked by Preston. Would also be a way to independently verify the data tracking method and related tools (e.g., preston, nomer) . However, I do realize the learning curve might be a bit challenging.
Am trying to strike a balance between doing the work and delegating where I can. Please advise. I can try to see what I can do.
am still catching up from my two weeks of travel.
Yes, I can imagine. I hope you had a good trip.
When would you like to have it?
ASAP. the paper is nearly ready to go
how much time can you or colleagues spend on this?
Not much, but some.
probably have the expertise to generate the imagined graph independently...
Yes, I think we could figure it out, but I doubt we could do it in time for this paper.
I think it is worth doing, but perhaps we should leave it out of the paper for now
@qgroom thanks for clarifying. I'll try to get something together this week. If I can't make that, we might have to discuss alternatives.
Thanks! Like I say, if it proves difficult to do quickly I still think it is worth doing, perhaps to add to the paper when it comes back for revisions. I can imagine many people will be interested in the result. Though I am very curious how it compares with my imagination
I am working on this today - will give you update at the end of my day at the latest.
@qgroom oef . I am amazed by how hard it is to make things simple.
I've completed a first (big) step - a script to select all archives seen at a specific time that contain still images as well as preserved specimen. Am running the script now. See https://github.com/jhpoelen/specimen-image-index .
A useful by catch of this exercise may be an exhaustive list of content specimen image archives (aka a specimen image index).
Just started the process to generate this list. As you might expect, this is a non-trivial exercise made somewhat doable using preston and friends.
Next step - link the image information to the taxonomic / specimen information, resolve related taxonomic information and put into bucket "insects" "plants" and "mammals".
Apologies for the delay. This is a labor and computationally intense task.
For what it is worth - the make.sh in https://github.com/jhpoelen/specimen-image-index is build to run in a parallel workflow. Currently running > 8 cores in some undisclosed server somewhere in Europe.
. . . hours later . . . finished a first pass at counting images across time.
Am running on the server as we speak.
The idea is to:
resulting table should look something like:
name | provIdShort | provDate |
---|---|---|
Insecta | 3df3 | 2022-07-22 |
Insecta | 3df3 | 2022-07-22 |
... | ... | ... |
where each row represents a named specimen records with one or more images. The names are either "Insecta", "Mammalia", or "Plantae" .
Very cool! I think I can follow how the code works and I can see it is quite intensive. My goto method would have been to load it all into a database, but that would have taken even more effort I imagine.
take 10 (random) preston snapshots
How do you ensure you get a good spread over time? Or do I misunderstand this?
How do you ensure you get a good spread over time? Or do I misunderstand this?
@qgroom great question! The random sample was a bit of a . . . random sample. If you'd like to cherry pick dates, you can do that too. What did you have in mind?
The snapshot sampling frequency has been monthly since about 2019-03 . Prior to that, weekly snapshots were taken starting 2018-09, but the logs have a slightly different format, making it a little more difficult to process with methods implemented.
Btw - make.sh was able to process about 4-5 snapshots overnight without too much optimization. However, I forgot to install "miller" https://miller.readthedocs.io/ on the server, so the last part of processing (join/align) is still missing. Have installed it now, and hoping to get some results later in the day.
We now have a neat list of tracked content with "traits" like "has still image" or "has preserved specimen". In other words, we've build a time series of digitized collections of preserved specimen with images as seen by iDigBio / GBIF / BioCaSE. That is something right? If you'd like I can share some preliminary results on that.
Also, yes, I agree that make.sh looks a little involved. I'd be curious to see alternate approach that you or your colleagues can come up with . . .
The snapshot sampling frequency has been monthly since about 2019-03 . Prior to that, weekly snapshots were taken starting 2018-09, but the logs have a slightly different format, making it a little more difficult to process with methods implemented.
Perhaps we will end up with a reasonable time series by random chance.
Btw - make.sh was able to process about 4-5 snapshots overnight without too much optimization. However, I forgot to install "miller" https://miller.readthedocs.io/ on the server, so the last part of processing (join/align) is still missing. Have installed it now, and hoping to get some results later in the day.
I'm sitting on the edge of my chair :pray:
We now have a neat list of tracked content with "traits" like "has still image" or "has preserved specimen". In other words, we've build a time series of digitized collections of preserved specimen with images as seen by iDigBio / GBIF / BioCaSE. That is something right? If you'd like I can share some preliminary results on that.
That is indeed something! Yes, I'd love to see some results
Also, yes, I agree that make.sh looks a little involved. I'd be curious to see alternate approach that you or your colleagues can come up with . . .
It didn't look that bad. In fact with example code like this it would be possible to adapt it for other purposes as well. Most of my coding is done my copying examples 😉
Ok, will keep you posted on results. Apologies for the suspense . . . :wink:
Also, as far as copy/pasting goes . . . I am a big fan of copy/pasting also.
Would it make sense to turn this into a little copy-paste-compatible mini-tutorial on how to process all the data using the streaming approaches that I used? If so, what format would you like this to be? Extended readme? Or something more involved?
Would it make sense to turn this into a little copy-paste-compatible mini-tutorial on how to process all the data using the streaming approaches that I used? If so, what format would you like this to be? Extended readme? Or something more involved?
Certainly! An extended read me would be fine.
I'm wonder whether we could make this a box text within the paper. It might be a good advert for the method
The first result just rolled in:
$ cat plantae_insecta_or_mammalia_image_251f.tsv | sort | uniq -c
587806 Insecta 251f 2021-07-01
19321 Mammalia 251f 2021-07-01
11324522 Plantae 251f 2021-07-01
Which would be: 11M plants / 588k insects / 19k mammals . . . preserved specimen records with still image observed via Preston's 2021-07-01 snapshot.
Seems a bit low, probably due to name alignment issues.
We could run the results past @timrobertson100. He will probably know if we're in the right ball park
For the 2021-07-01 snapshot 13M preserved specimen records with still images were extracted.
$ cat content-name-image_251f.tsv | wc -l
13874481
Apparently, our methods picked up only 705 datasets that contained still images:
$ cat content-with-still-images_251f.tsv | wc -l
705
and 1158 datasets with multimedia references in their schema.
$ cat content-with-multimedia_251f.tsv | wc -l
1158
so, the name alignment might not be a major factor. . . . can I walk you through the extraction process via video chat. This way, I can review the method I use to detect the preserved specimen and still images.
On the bright side, the method is applied consistently, so a subset might highlight a representative relative change in image availability for preserved specimen.
I suppose it would be good to calculate the total of all taxa so that if there was loss due to names alignment we still have that total too. Is that doable without having to extract again?
I've shared the intermediate results for the 2021-07-01 snapshot via temporary download link at: https://send.tresorit.com/a#ix7d4A9Tj1Ka4RUPXPRGDw
Curious to hear your thoughts.
We could run the results past @timrobertson100. He will probably know if we're in the right ball park
Using GBIF.org with filters for preserved specimen records having image you'd get 38M plant records, 638k mammal records and 3.4M insect records.
Please shout if you need anything beyond that.
@timrobertson100 thanks for looking this up!
Some questions:
According to my records for the 2021-07-01 snapshot (over 1 year ago), there's a bunch (>900k) of records with blank scientific names with still images, which might account for some of the differences.
Here's the top 10 most frequently occurring scientificNames in the tracked dataset.
$ zcat content-name-image_251f.tsv.gz | cut -f2 | sort | uniq -c |sort -nr | head
951474
12516 Demospongiae
11204 Plantae
10587 Carex
10307 Fungi
8927 Polystichum acrostichoides
8032 Asplenium platyneuron
8012 Equisetum arvense
7976 Dicranum scoparium
7969 Cyperus strigosus
Also, please note that I also included a single record in case one or more images were present. So, if one specimen has 10 images, only a single row is included. @timrobertson100 do the results from the GBIF gallery count images or records with images?
And, a bit suspicious that no mice can be found in the collection:
$ zcat content-name-image_251f.tsv.gz | grep "Mus musculus" | wc -l
0
whereas https://www.gbif.org/occurrence/gallery?basis_of_record=PRESERVED_SPECIMEN&media_type=StillImage&taxon_key=7429082&occurrence_status=present has a bunch of images. See attached screenshot.
For the mouse example, I tracked down a dataset with some mice in it. And confirmed that these image are picked up using the developed image indexing method.
$ preston track "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"
...
$ cat content-name-image_cf58.tsv | grep "Mus"
262604 Mus musculus cf58 2022-08-18
262604 Mus musculus cf58 2022-08-18
262604 Mus musculus cf58 2022-08-18
288952 Mus musculus cf58 2022-08-18
288952 Mus musculus cf58 2022-08-18
288952 Mus musculus cf58 2022-08-18
So, it might just be a dataset availability issue. These issues should come out once more snapshots are processed. And more should be available soon. Thanks for being patient as I am keeping detailed notes on this. I am assuming that the problem lies in my method, so I am carefully tracing the evidence.
And, going back to 2021-07-01, I can confirm that the same NEON dataset did include a mouse record . . . but not images were available for it as far as I can tell.
associated coreIds
$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
| grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
| preston dwc-stream --remote https://linker.bio\
| grep "Mus mus"\
| jq '.["http://rs.tdwg.org/dwc/text/id"]'
"262604"
"395539"
associated occurrenceIds
$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
| grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
| preston dwc-stream --remote https://linker.bio\
| grep "Mus mus"\
| jq '.["http://rs.tdwg.org/dwc/terms/occurrenceID"]'
"NEON01IAG"
"NEON01IPX"
But no images were present for these specimen records at that time:
$ preston cat --remote https://linker.bio hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417\
| grep "https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip"\
| preston dwc-stream --remote https://linker.bio\
| grep "Mus mus"\
| grep -E "262604|395539"\
| grep StillImage\
| wc -l
0
@timrobertson100 any way to confirm that at 2021-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?
I wonder if there's a way to get a hold of the metadata of the images GBIF caches?
Would this include a non-blank scientific name for the associated record?
Yes, e.g. these 981k Plant records which match to higher ranks in the backbone - some may have names that the backbone doesn't deal with though. We'd also assemble names from e.g. genus
and specificEpithet
How do you normalize the stillimage records?
As far as I can read, the same places as you do - dwc:associatedMedia
, the Audubon Core, SimpleImage and MultimediaExtension and also the ABCD records. You can see which extensions are in play if you use GBIF.org occurrence search.
Prior to parsing using the gbif name parser, do you do any name scrubbing?
We use the species/match
API primarily with a little cleaning beforehand (e.g. for Abies alba)
@timrobertson100 do the results from the GBIF gallery count images or records with images?
Records, not images
@timrobertson100 any way to confirm that at 2022-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?
Assuming that is the date you took a copy, the closest ingestion we have before that was on 2022-06-23 20:41
. The archive we crawled is here (temp location). If you look in the multimedia file you'll find record ID 262604
has this image URL which is a mouse.
I hope this helps.
I wonder if there's a way to get a hold of the metadata of the images GBIF caches?
I might misunderstand something, but the full metadata should be in the related image extension included in the monthly DwC-A format snapshot we produce. Or do I misunderstand please?
I wonder if there's a way to get a hold of the metadata of the images GBIF caches?
I might misunderstand something, but the full metadata should be in the related image extension included in the monthly DwC-A format snapshot we produce. Or do I misunderstand please?
Sorry I meant the image file metadata, which may include timestamps. The derivatives produced by the zoom tool exclude these, but there may have been more in the originals.
@timrobertson100 thanks for providing an historic dwca .
Using your (custom?) link:
$ preston track https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca
...
<https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca> <http://purl.org/pav/hasVersion> <hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1> <urn:uuid:24027b6a-52f1-4bb1-a3f8-8fe5b156e78e> .
...
$ cat content-name-image_c3a2.tsv | grep Mus
262604 Mus musculus c3a2 2022-08-18
262604 Mus musculus c3a2 2022-08-18
262604 Mus musculus c3a2 2022-08-18
288952 Mus musculus c3a2 2022-08-18
288952 Mus musculus c3a2 2022-08-18
288952 Mus musculus c3a2 2022-08-18
confirming your observation that the mouse images exist.
However, the file you presented was different than the one I observed over a year ago on 2021-07-02T02:55:07.136Z
as document in in provenance log below.
The dwca I saw had content id: hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d
whereas the file you shared via https://download.gbif.org/tim/f28219f3-49cd-4b55-99ad-39f3ced87ddd.139.dwca had a content id of: hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1
I am now trying to figure out what why the two files are different.
$ preston cat --remote https://linker.bio 'line:hash://sha256/251fa349c051bbda370decb7e5e58960d702add59f6e131ebf7c960d0f93b417!/L234648-L234656'
<f28219f3-49cd-4b55-99ad-39f3ced87ddd> <http://www.w3.org/ns/prov#hadMember> <https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> .
<https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <http://purl.org/dc/elements/1.1/format> "application/dwca" <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> .
<hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#generatedAtTime> "2021-07-02T02:55:07.136Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:15bb8480-eda7-4080-9716-02b43cc07be8> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<urn:uuid:181f177a-937f-4367-b824-50d8882862c6> <http://www.w3.org/ns/prov#used> <https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
<https://biorepo.neonscience.org/portal/content/dwca/NEON-MAMC-VSS_DwC-A.zip> <http://purl.org/pav/hasVersion> <hash://sha256/3d4cbfb3b72738421a3acad94809795dee4d7bb38e12d0ede1a35ae20554e56d> <urn:uuid:181f177a-937f-4367-b824-50d8882862c6> .
Oops. After checking the eml of the file you shared via
$ preston cat 'zip:hash://sha256/d069703fc281c5a0d5b12992aa88da78fb709fed3c40202ce54c8e92a05001b1!/eml.xml'\
| xmllint -format -\
| grep pubDate
<pubDate>2022-06-23</pubDate>
I realized I made a (now corrected) typo in my earlier comment https://github.com/bio-guoda/preston/issues/168#issuecomment-1219729828 : I was referring to the date 2021-07-01, not 2022-07-01.
@timrobertson100 Apologies. Any chance you can retrieve your cached dwca from a year earlier?
hey @qgroom I took some time to create some copy-paste examples for you to enjoy. I hope you do! And . . . curious to hear your thoughts on https://github.com/jhpoelen/specimen-image-index#specimen-image-index .
Sorry I meant the image file metadata, which may include timestamps. The derivatives produced by the zoom tool exclude these, but there may have been more in the originals.
Sorry @matdillen, we don't have that. Only the URLs I'm afraid, and the likes of Audobon core
Any chance you can retrieve your cached dwca from a year earlier?
@jhpoelen I've checked the archives, and confirm the 2 mouse records in the archives don't have images. I've added a few more archives on the temp location and the ingestion history will give you the dates.
I'm not sure it is helpful at this stage, but the way I'd recommend the original chart be made @qgroom is to download a series of the monthly DwC-A which date back to May 2018, and then script the join to count i.e. the equivalent of:
count(distinct gbifID)
FROM interpreted JOIN multimedia
WHERE basisOFRecord='PRESERVED_SPECIMEN' AND kingdomKey=...
@timrobertson100 thanks for confirming that
2021-07-01 dataset https://www.gbif.org/dataset/f28219f3-49cd-4b55-99ad-39f3ced87ddd NEON Biorepository Mammal Collection (Vouchers [Standard Sampling]) had the mouse records, but not yet had the associated mouse images as suggested above?
Also, I leave it up to @qgroom to use whatever method suited to produce the desired results.
@qgroom I manage to extract the following measurements using the make.sh script as described in https://github.com/jhpoelen/specimen-image-index .
Temporary download link to full results zip with intermediate results >400MB zipfile see https://send.tresorit.com/a#5wAllb_q0khty6N1SOplaA .
the table below was built using:
cat plantae_insecta_or_mammalia_image_*\
| sort\
| uniq -c\
| mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1\
| mlr --tsvlite sort -f 3\
| mlr --itsvlite --omd cat
date | prov id | count | name |
---|---|---|---|
2019-03-01 | 2611 | 375999 | Insecta |
2019-03-01 | 2611 | 10887 | Mammalia |
2019-03-01 | 2611 | 6275820 | Plantae |
2019-06-01 | 9a41 | 167034 | Insecta |
2019-06-01 | 9a41 | 10930 | Mammalia |
2019-06-01 | 9a41 | 6661441 | Plantae |
2019-07-01 | b986 | 245668 | Insecta |
2019-07-01 | b986 | 10647 | Mammalia |
2019-07-01 | b986 | 6652128 | Plantae |
2020-08-01 | 8ff4 | 470966 | Insecta |
2020-08-01 | 8ff4 | 12012 | Mammalia |
2020-08-01 | 8ff4 | 7731709 | Plantae |
2021-07-01 | 251f | 587806 | Insecta |
2021-07-01 | 251f | 19321 | Mammalia |
2021-07-01 | 251f | 11324522 | Plantae |
2022-07-01 | da74 | 908204 | Insecta |
2022-07-01 | da74 | 18470 | Mammalia |
2022-07-01 | da74 | 13644513 | Plantae |
To give an idea of how many imaged preserved specimen had their names matched . . . the total amount of detected preserved specimen with some image per snapshot:
cat content-name-image_*\
| cut -f3,4\
| sort\
| uniq -c\
| mlr --implicit-csv-header --tsvlite reorder -f 2,1\
| mlr --tsvlite sort -f 2\
| mlr --itsvlite --omd cat
date | total preserved specimen with image | prov id |
---|---|---|
2019-03-01 | 8883424 | 2611 |
2019-06-01 | 8903058 | 9a41 |
2019-07-01 | 9027607 | b986 |
2020-08-01 | 10141281 | 8ff4 |
2021-07-01 | 13874481 | 251f |
2022-07-01 | 16522128 | da74 |
This can be integrated into the first table if needed to get a sense of relative contributions of mammals, insects, and plants toward the total of detected preserved specimen as documented in make.sh
.
date | name | count |
---|---|---|
2019-03-01 | Insecta | 375999 |
2019-03-01 | Mammalia | 10887 |
2019-03-01 | Plantae | 6275820 |
2019-03-01 | any | 8883424 |
2019-06-01 | Insecta | 167034 |
2019-06-01 | Mammalia | 10930 |
2019-06-01 | Plantae | 6661441 |
2019-06-01 | any | 8903058 |
2019-07-01 | Insecta | 245668 |
2019-07-01 | Mammalia | 10647 |
2019-07-01 | Plantae | 6652128 |
2019-07-01 | any | 9027607 |
2020-08-01 | Insecta | 470966 |
2020-08-01 | Mammalia | 12012 |
2020-08-01 | Plantae | 7731709 |
2020-08-01 | any | 10141281 |
2021-07-01 | Insecta | 587806 |
2021-07-01 | Mammalia | 19321 |
2021-07-01 | Plantae | 11324522 |
2021-07-01 | any | 13874481 |
2022-07-01 | Insecta | 908204 |
2022-07-01 | Mammalia | 18470 |
2022-07-01 | Plantae | 13644513 |
2022-07-01 | any | 16522128 |
I realize that the snapshot sampling is a bit funny, but we can rerun with more specific dates if needed.
Note also that the preserved specimen with images is not guaranteed to go up . . . consistent with that datasets are known to disappear or otherwise change. However, over all, it appears I was able to recreate your initial sketch. Lots of stuff to think about. Please let me know if you'd like to chat about this today.
And . . . whats going on with those mammals? A bit camera shy? Or perhaps they write down their dwc-a differently than their botanist or entomologist colleagues.
Some more summary data:
cat content-with-still-images_*.tsv | cut -f2,3 | sort | uniq -c | mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1 | mlr --tsvlite sort -f 2 | mlr --itsvlite --omd cat
date | datasets with still images | prov id |
---|---|---|
2019-03-01 | 387 | 2611 |
2019-03-31 | 545 | 5a39 |
2019-06-01 | 503 | 9a41 |
2019-07-01 | 525 | b986 |
2020-08-01 | 608 | 8ff4 |
2020-11-01 | 585 | d98e |
2021-07-01 | 705 | 251f |
2021-11-01 | 587 | 83b4 |
2022-07-01 | 899 | da74 |
I would have expected this number to be a bit higher because I suspect that most plazi treatments have still images in them. And plazi treatments make up for most of the datasets registered in GBIF. If I remember correctly, there's over 10k plazi datasets registered with GBIF.
cat content-with-still-images-and-specimen_*.tsv | cut -f2,3 | sort | uniq -c | mlr --implicit-csv-header --tsvlite reorder -f 4,3,2,1 | mlr --tsvlite sort -f 2 | mlr --itsvlite --omd cat
date | datasets with images and preserved specimen | prov id |
---|---|---|
2019-03-01 | 371 | 2611 |
2019-06-01 | 374 | 9a41 |
2019-07-01 | 397 | b986 |
2020-08-01 | 458 | 8ff4 |
2022-07-01 | 645 | da74 |
This shows that, in addition to an increased number of registered images for preserved specimen, the number of datasets with preserved specimen went up also. It remains to be seen which collections drove most of the additions of preserved specimen images. Perhaps a neat follow-up question to consider.
@timrobertson100 re: GBIF monthly snapshots - do you have similar snapshots of the source data that these snapshots were derived from?
Have access to your source data would help me verify the results above without adding the (useful) magic that GBIF applies to process the original data to the mix (e.g., compiling taxonomic names from various sources, scrubbing and parsing of taxonomic names, taxonomic name matching against some specific version of the GBIF backbone taxonomy, caching of no longer available datasets),
as suggested by @qgroom