Closed nielsklazenga closed 1 month ago
Waiting on devops to allow access to EMR studio to manipulate UUID AVRO
12/06/2024 Associated Freshdesk ticket: https://support.ehelp.edu.au/a/tickets/189866
New WELT data resource was created: dr26642
24/06/2024
Current status
Databox - Avros split and data updated
Production
To do:
18/07/2024
WELT Institution and Collection codes not mapping
Tasks
18/07/2024
Original dr2153 had multiple collections associated. Each of these needs Institution Code and Provider Map
Collection Summary - in test to date
CollectionCode | CollectionID | Collection Name | Recs in dr2153 | Institution | Inst ID | Institution Code | Provider Map | |
---|---|---|---|---|---|---|---|---|
AK | co249 | Auckland War Mem. Mus. Herbarium | 302043 | Auckland War Memorial Museum | in113 | AK | Yes | |
PDD | co215 | New Zealand Fungarium Te Kohinga Hekaheka o Aotearoa | 105756 | Landcare Research NZ Ltd | in92 | PDD | Yes | |
NZFRI | co255 | National Forestry Herbarium | 28934 | Scion | in119 | NZFRI | Yes | |
CANU | co250 | Univ of Canterbury Herbarium | 17599 | University of Canterbury | in114 | CANU | Yes | |
MPN | co254 | Dame Ella Campbell Herbarium | 16374 | Massey University | in118 | MPN | Yes | |
LINC | co253 | Lincoln University Herbarium | 9295 | Lincoln University | in115 | LINC | Yes | |
UNITEC | co218 | Unitec Inst. of Tech. Herbarium | 6752 | Unitec Institute of Technology | in99 | UNITEC | Yes | |
NaN | 43 |
18/07/2024
Next steps:
18/07/2024
Dataresources - in test to date
Inst ID | Inst Code | Coll. Code | Coll. ID | Coll. Name | DR | DR Name | In IPT | In GBIF | GBIF Link | IPT |
---|---|---|---|---|---|---|---|---|---|---|
in113 | AK | AK | co249 | Auckland War Mem. Mus. Herbarium | dr22714 | Auckland Museum Botany Collection | Yes | Yes | https://www.gbif.org/dataset/83ae84cf-88e4-4b5c-80b2-271a15a3e0fc | ? |
in92 | CHR | CHR | co214 | Allan Herbarium | dr22800 | Allan Herbarium | Yes | Yes | https://www.gbif.org/dataset/df582950-3b58-11dc-8c19-b8a03c50a862 | https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd |
in92 | PDD | PDD | co215 | New Zealand Fungarium Te Kohinga Hekaheka.. | dr22783 | New Zealand Fungal and Plant Disease Collection | Yes | Yes | https://www.gbif.org/dataset/ee27b1b0-3b55-11dc-8c18-b8a03c50a862 | https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd |
in119 | NZFRI | NZFRI | co255 | National Forestry Herbarium | dr22784 | |||||
in114 | CANU | CANU | co250 | Univ of Canterbury Herbarium | dr22785 | |||||
in118 | MPN | MPN | co254 | Dame Ella Campbell Herbarium | dr22786 | |||||
in115 | LINC | LINC | co253 | Lincoln University Herbarium | dr22787 | |||||
in97 | UNITEC | UNITEC | co218 | Unitec Inst. of Tech. Herbarium | dr22788 | |||||
in82 | NMNZ | WELT | co216 | WELT Herbarium at Museum of NZ Te Papa... | dr22717 | WELT Herbarium at Museum of New Zealand Te Papa | Yes | Yes | https://www.gbif.org/dataset/cafff6a5-1fa4-4a90-a2b3-f3db78b93d02 | https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens |
22/07/2024
Databox
Prod
22/07/2024 - status update
dr2153 is made up of 9 collections. Only 3 of these are currently in IPT that I can find:
Databox All 3 of these have data in databox. However, dr2153 is not updated with only CHR records as that would mean the data for all the other collections would be missing from ALA We need to find out the source of those, but I have no idea where to go for that. I've searched what I can in IPT and GBIF - but will try again. Provider maps are all fine.
Prod
Peggy has said that for WELT we don't need to worry about the AVRO updates as neither dr2153 or dr26642 will be pushed to GBIF. This was clarified in support request that highlighted the duplication in GBIF. All that got sorted.
What I would like to do is load:
What we need to find out:
13/08/2024 Discussed with Mahmoud. No real idea what to do from here as there is no information re where the data for collections is coming from, apart from those available in IPT.
Slack message: https://atlaslivingaustralia.slack.com/archives/G0106GABXC3/p1716384230610869 Slack messsage says: So, DRs are:
The above is not quite correct. What we have is:
Production None of these datasets are shared with GBIF from ALA
13/08/2024 Dataresource status update
Code | DR | Name | IPT | Data |
---|---|---|---|---|
NZVH | dr2153 | NZ Virtual Herbarium | Original NZVH | |
AK | dr22714 | Auckland Museum Botany Collection | https://ipt2.aucklandmuseum.com:8443/ipt/archive.do?r=botany | 309,831 |
CHR | dr22800 | Allan Herbarium | https://ipt.landcareresearch.co.nz/archive.do?r=allan_herbarium | 344,846 |
PDD | dr22783 | NZ Fungal & Plant Disease Collection | https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd | 113,202 |
WELT | dr22717 | WELT | https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens | 251,745 |
Unknown IPT data source
Code | DR | Name | IPT | Data |
---|---|---|---|---|
NZFRI | dr22784 | National Forestry Herbarium | ||
CANU | dr22785 | Univ of Canterbury Herbarium | ||
MPN | dr22786 | Dame Ella Campbell Herbarium | ||
LINC | dr22787 | Lincoln University Herbarium | ||
UNITEC | dr22788 | Unitec Inst. of Tech. Herbarium |
Code | DR | Name | IPT | Data |
---|---|---|---|---|
NZVH | dr2153 | NZ Virtual Herbarium | Original NZVH | |
AK | dr26650 | Auckland Museum Botany Collection | https://ipt2.aucklandmuseum.com:8443/ipt/archive.do?r=botany | 309,831 |
CHR | dr27654 | Allan Herbarium | https://ipt.landcareresearch.co.nz/archive.do?r=allan_herbarium | 344,846 |
PDD | dr26651 | NZ Fungal & Plant Disease Collection | https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd | 113,202 |
WELT | dr26642 | WELT | https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens | 251,745 |
14/08/2024
As per tables above datasets for AK, CHR, PDD, WELT have been extracted via IPT:
Data source for NZFR, CANU, MPN, LINC, UNITEC is yet to be determined.
Note: Data loads for IPT datasets are triggered via Load Dataset dag only, NOT preingestion. There are character encodings in the files that are problematic with preingestion. GBIF has no problem with them and the team has advised that given the data source is IPT preingestion process is not required.
19/08/2024
Record counts
27/08/2024 Mtg NK and MS
To Do:
Note: This issue to be closed and new issues raised when datasets for NZFRI, CANU, MPN, LINC, UNITEC are available
dr27654 (CHR) had not been added to the AVH Hub. I have done that now and am re-ingesting the data, so we will see tomorrow if it is there.
Hi Niels, How does a dataset get added to AVH Hub? Is it something that will need to be done for each of the datasets?
thanks Rose
From: Niels Klazenga @.> Sent: 27 August 2024 16:28 To: AtlasOfLivingAustralia/data-management @.> Cc: OConnor, Rosemary (NCMI, Dutton Park) @.>; Assign @.> Subject: Re: [AtlasOfLivingAustralia/data-management] Split NZVH data resource (dr2153) into individual data resources for each institution (Issue #1002)
dr27654 (CHR) had not been added to the AVH Hub. I have done that now and am re-ingesting the data, so we will see tomorrow if it is there.
— Reply to this email directly, view it on GitHubhttps://github.com/AtlasOfLivingAustralia/data-management/issues/1002#issuecomment-2311675619, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZZDFDERJSOV67KPQ7P47TZTQMAFAVCNFSM6AAAAABAHCM6C6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGY3TKNRRHE. You are receiving this because you were assigned.Message ID: @.***>
@rosemaryjoconnor , you do that here: https://collections.ala.org.au/dataHub/show/dh9. And yes, that needs to be done for all data resources that are part of AVH. I must have done the other ones already before I went on leave.
Reload of CHR data failed in pre-ingestion, btw.
Okay, this time it worked. We'll see tomorrow.
19/09/2024
@nielsklazenga - Databox update - can you take a look and let me know if it is what you want/expect.
I don't have metadata for the collections so the collectory DRs look very sparse. If you would like that populated in databox any suggestions where to find the info would be great.
Databox I've run the code to split out the collections below from dr2153 in databox. I have not deleted any records from dr2153 in databox, as I need them there for testing. If I had been deleting, the split would leave only NZFRI in dr2153 this can be changed of course. There are also 43 records with no collectionCode - file attached. All round the process is straightforward.
Code | DR | Name |
---|---|---|
NZFRI | dr2153 | National Forestry Herbarium |
CANU | dr22785 | Univ of Canterbury Herbarium |
MPN | dr22786 | Dame Ella Campbell Herbarium |
LINC | dr22787 | Lincoln University Herbarium |
UNITEC | dr22788 | Unitec Inst. of Tech. Herbarium |
Unique identifiers and UUID Avro updates
Based on discussions while you were away, UUID avro updates would only be needed if the source of the above are not IPT and would need to be pushed to GBIF. However I will finalise the code for it this week or early next week and thoroughly test the process in databox, just in case we need it. That will mean deleting one of the collections from dr2153 to test but that's ok. It's a little more complex than just changing the DR on the uniqueKey in the avro as I have to extract the correct avro records based on occurrenceIDs for each specific collection. I've done this previously for WELT, this just will be set up to work for all of the collections above in one go.
I can incorporate some of Mahmoud's code to backup current Avros on S3 and do the upload to S3 directly.
Attached: CSV file of records with no collectionCode: nan-collectionCode.csv
Hi @rosemaryjoconnor , there is no need to do anything with the UUIDs, as we do not deliver this data to GBIF and there is a good chance that when new datasets are delivered the catalogNumber
s will have changed anyway. I was just going to set up a new DR when a collection provides us with a new dataset and remove that collection's data from dr2153 and eventually delete dr2153 (or leave it empty).
@nielsklazenga great Niels. I'll just leave it all as-is then. So no need to set up any DRs at this stage. If you do want any of the code just let me know, it will be in github anyway.
24/09/2024
Databox New data for NZFRI is available:
Production
26/09/2024
Prod
Problem
Solution