AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Split NZVH data resource (dr2153) into individual data resources for each institution #1002

Closed nielsklazenga closed 1 month ago

nielsklazenga commented 11 months ago

Problem

Solution

peggynewman commented 10 months ago

Dependent on https://github.com/AtlasOfLivingAustralia/preingestion/issues/228

peggynewman commented 7 months ago

Waiting on devops to allow access to EMR studio to manipulate UUID AVRO

rosemaryjoconnor commented 4 months ago

12/06/2024 Associated Freshdesk ticket: https://support.ehelp.edu.au/a/tickets/189866

New WELT data resource was created: dr26642

rosemaryjoconnor commented 4 months ago

24/06/2024

Current status

Databox - Avros split and data updated

Production

To do:

rosemaryjoconnor commented 3 months ago

18/07/2024

  1. Allan Herbarium - dr2153 now loading in production with IPT data
  2. WELT Herbarium - dr26642 now loading in production with IPT data

WELT Institution and Collection codes not mapping

Tasks

rosemaryjoconnor commented 3 months ago

18/07/2024

Original dr2153 had multiple collections associated. Each of these needs Institution Code and Provider Map

Collection Summary - in test to date

CollectionCode CollectionID Collection Name Recs in dr2153 Institution Inst ID Institution Code Provider Map
AK co249 Auckland War Mem. Mus. Herbarium 302043 Auckland War Memorial Museum in113 AK Yes
PDD co215 New Zealand Fungarium Te Kohinga Hekaheka o Aotearoa 105756 Landcare Research NZ Ltd in92 PDD Yes
NZFRI co255 National Forestry Herbarium 28934 Scion  in119 NZFRI Yes
CANU co250 Univ of Canterbury Herbarium 17599 University of Canterbury in114 CANU Yes
MPN co254 Dame Ella Campbell Herbarium 16374 Massey University in118 MPN Yes
LINC co253 Lincoln University Herbarium 9295 Lincoln University in115 LINC Yes
UNITEC co218 Unitec Inst. of Tech. Herbarium 6752 Unitec Institute of Technology in99 UNITEC Yes
NaN 43
rosemaryjoconnor commented 3 months ago

18/07/2024

Next steps:

rosemaryjoconnor commented 3 months ago

18/07/2024

Dataresources - in test to date

Inst ID Inst Code Coll. Code Coll. ID Coll. Name DR DR Name In IPT In GBIF GBIF Link IPT
in113 AK AK co249 Auckland War Mem. Mus. Herbarium dr22714 Auckland Museum Botany Collection Yes Yes https://www.gbif.org/dataset/83ae84cf-88e4-4b5c-80b2-271a15a3e0fc ?
in92 CHR CHR co214 Allan Herbarium dr22800 Allan Herbarium Yes Yes https://www.gbif.org/dataset/df582950-3b58-11dc-8c19-b8a03c50a862 https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd
in92 PDD PDD co215 New Zealand Fungarium Te Kohinga Hekaheka.. dr22783 New Zealand Fungal and Plant Disease Collection Yes Yes https://www.gbif.org/dataset/ee27b1b0-3b55-11dc-8c18-b8a03c50a862 https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd
in119 NZFRI NZFRI co255 National Forestry Herbarium dr22784
in114 CANU CANU co250 Univ of Canterbury Herbarium dr22785
in118 MPN MPN co254 Dame Ella Campbell Herbarium dr22786
in115 LINC LINC co253 Lincoln University Herbarium dr22787
in97 UNITEC UNITEC co218 Unitec Inst. of Tech. Herbarium dr22788
in82 NMNZ WELT co216 WELT Herbarium at Museum of NZ Te Papa... dr22717 WELT Herbarium at Museum of New Zealand Te Papa Yes Yes https://www.gbif.org/dataset/cafff6a5-1fa4-4a90-a2b3-f3db78b93d02 https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens
rosemaryjoconnor commented 3 months ago

22/07/2024

Databox

Prod

rosemaryjoconnor commented 3 months ago

22/07/2024 - status update

dr2153 is made up of 9 collections. Only 3 of these are currently in IPT that I can find:

Databox All 3 of these have data in databox. However, dr2153 is not updated with only CHR records as that would mean the data for all the other collections would be missing from ALA We need to find out the source of those, but I have no idea where to go for that. I've searched what I can in IPT and GBIF - but will try again. Provider maps are all fine.

Prod

Peggy has said that for WELT we don't need to worry about the AVRO updates as neither dr2153 or dr26642 will be pushed to GBIF. This was clarified in support request that highlighted the duplication in GBIF. All that got sorted.

What I would like to do is load:

What we need to find out:

rosemaryjoconnor commented 2 months ago

13/08/2024 Discussed with Mahmoud. No real idea what to do from here as there is no information re where the data for collections is coming from, apart from those available in IPT.

Slack message: https://atlaslivingaustralia.slack.com/archives/G0106GABXC3/p1716384230610869 Slack messsage says: So, DRs are:

The above is not quite correct. What we have is:

Production None of these datasets are shared with GBIF from ALA

rosemaryjoconnor commented 2 months ago

13/08/2024 Dataresource status update

Databox/Test

Code DR Name IPT Data
NZVH dr2153 NZ Virtual Herbarium Original NZVH
AK dr22714 Auckland Museum Botany Collection https://ipt2.aucklandmuseum.com:8443/ipt/archive.do?r=botany 309,831
CHR dr22800 Allan Herbarium https://ipt.landcareresearch.co.nz/archive.do?r=allan_herbarium 344,846
PDD dr22783 NZ Fungal & Plant Disease Collection https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd 113,202
WELT dr22717 WELT https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens 251,745

Unknown IPT data source

Code DR Name IPT Data
NZFRI dr22784 National Forestry Herbarium
CANU dr22785 Univ of Canterbury Herbarium
MPN dr22786 Dame Ella Campbell Herbarium
LINC dr22787 Lincoln University Herbarium
UNITEC dr22788 Unitec Inst. of Tech. Herbarium

Production

Code DR Name IPT Data
NZVH dr2153 NZ Virtual Herbarium Original NZVH
AK dr26650 Auckland Museum Botany Collection https://ipt2.aucklandmuseum.com:8443/ipt/archive.do?r=botany 309,831
CHR dr27654 Allan Herbarium https://ipt.landcareresearch.co.nz/archive.do?r=allan_herbarium 344,846
PDD dr26651 NZ Fungal & Plant Disease Collection https://ipt.landcareresearch.co.nz/archive.do?r=new_zealand_national_fungal_herbarium_pdd 113,202
WELT dr26642 WELT https://ipt.tepapa.govt.nz/ipt/archive.do?r=weltspecimens 251,745
rosemaryjoconnor commented 2 months ago

14/08/2024

Dataset status

As per tables above datasets for AK, CHR, PDD, WELT have been extracted via IPT:

Data source for NZFR, CANU, MPN, LINC, UNITEC is yet to be determined.

Note: Data loads for IPT datasets are triggered via Load Dataset dag only, NOT preingestion. There are character encodings in the files that are problematic with preingestion. GBIF has no problem with them and the team has advised that given the data source is IPT preingestion process is not required.

Dataresource dr2153 - NZ Herbarium

rosemaryjoconnor commented 2 months ago

19/08/2024

Record counts

rosemaryjoconnor commented 2 months ago

27/08/2024 Mtg NK and MS

To Do:

Note: This issue to be closed and new issues raised when datasets for NZFRI, CANU, MPN, LINC, UNITEC are available

nielsklazenga commented 2 months ago

dr27654 (CHR) had not been added to the AVH Hub. I have done that now and am re-ingesting the data, so we will see tomorrow if it is there.

rosemaryjoconnor commented 2 months ago

Hi Niels, How does a dataset get added to AVH Hub? Is it something that will need to be done for each of the datasets?

thanks Rose


From: Niels Klazenga @.> Sent: 27 August 2024 16:28 To: AtlasOfLivingAustralia/data-management @.> Cc: OConnor, Rosemary (NCMI, Dutton Park) @.>; Assign @.> Subject: Re: [AtlasOfLivingAustralia/data-management] Split NZVH data resource (dr2153) into individual data resources for each institution (Issue #1002)

dr27654 (CHR) had not been added to the AVH Hub. I have done that now and am re-ingesting the data, so we will see tomorrow if it is there.

— Reply to this email directly, view it on GitHubhttps://github.com/AtlasOfLivingAustralia/data-management/issues/1002#issuecomment-2311675619, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZZDFDERJSOV67KPQ7P47TZTQMAFAVCNFSM6AAAAABAHCM6C6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGY3TKNRRHE. You are receiving this because you were assigned.Message ID: @.***>

nielsklazenga commented 2 months ago

@rosemaryjoconnor , you do that here: https://collections.ala.org.au/dataHub/show/dh9. And yes, that needs to be done for all data resources that are part of AVH. I must have done the other ones already before I went on leave.

Reload of CHR data failed in pre-ingestion, btw.

nielsklazenga commented 2 months ago

Okay, this time it worked. We'll see tomorrow.

rosemaryjoconnor commented 1 month ago

19/09/2024

@nielsklazenga - Databox update - can you take a look and let me know if it is what you want/expect.

I don't have metadata for the collections so the collectory DRs look very sparse. If you would like that populated in databox any suggestions where to find the info would be great.

Databox I've run the code to split out the collections below from dr2153 in databox. I have not deleted any records from dr2153 in databox, as I need them there for testing. If I had been deleting, the split would leave only NZFRI in dr2153 this can be changed of course. There are also 43 records with no collectionCode - file attached. All round the process is straightforward.

Code DR Name
NZFRI dr2153 National Forestry Herbarium
CANU dr22785 Univ of Canterbury Herbarium
MPN dr22786 Dame Ella Campbell Herbarium
LINC dr22787 Lincoln University Herbarium
UNITEC dr22788 Unitec Inst. of Tech. Herbarium

Unique identifiers and UUID Avro updates

Based on discussions while you were away, UUID avro updates would only be needed if the source of the above are not IPT and would need to be pushed to GBIF. However I will finalise the code for it this week or early next week and thoroughly test the process in databox, just in case we need it. That will mean deleting one of the collections from dr2153 to test but that's ok. It's a little more complex than just changing the DR on the uniqueKey in the avro as I have to extract the correct avro records based on occurrenceIDs for each specific collection. I've done this previously for WELT, this just will be set up to work for all of the collections above in one go.

I can incorporate some of Mahmoud's code to backup current Avros on S3 and do the upload to S3 directly.

Attached: CSV file of records with no collectionCode: nan-collectionCode.csv

nielsklazenga commented 1 month ago

Hi @rosemaryjoconnor , there is no need to do anything with the UUIDs, as we do not deliver this data to GBIF and there is a good chance that when new datasets are delivered the catalogNumbers will have changed anyway. I was just going to set up a new DR when a collection provides us with a new dataset and remove that collection's data from dr2153 and eventually delete dr2153 (or leave it empty).

rosemaryjoconnor commented 1 month ago

@nielsklazenga great Niels. I'll just leave it all as-is then. So no need to set up any DRs at this stage. If you do want any of the code just let me know, it will be in github anyway.

rosemaryjoconnor commented 1 month ago

24/09/2024

Databox New data for NZFRI is available:

Production

rosemaryjoconnor commented 1 month ago

26/09/2024

Prod