Migrate OpenData datasets to Zenodo

eliagbayani commented 3 months ago

Steps:

retrieve opendata.eol.org datasets using CKAN API
assign which fields from CKAN API will correspond to fields in the Zenodo API.
create Zenodo datasets using Zenodo API

eliagbayani commented 3 months ago

Hi Jen @jhammock, Attached is a list of private datasets from our five organizations in opendata.eol.org. I will exclude these datasets in migration to Zenodo. Unless you pick and want me to include some from the list. Thanks. private_datasets.txt

jhammock commented 3 months ago

Ah yes, we'll need to decide what to do about those. I expect the files used in resource connectors should go into the new docker container. I'll review the "old resources"; possibly those can go into Zenodo as well, but I'll check them individually.

eliagbayani commented 3 months ago

@jhammock @KatjaSchulz All broken URLs in opendata.eol.org are now once more accessible. That is, those URLs written in this long format (previously broken) are now accessible:

In the actual OpenData resource record, the URL is now transformed in this format (shorter):

Nonetheless, both URL formats are accessible. So we won't get any of these type of alerts anymore.

I needed this done before I migrate anything to Zenodo. Admittedly, fixing the broken long URLs was an accident when I made the shorter URLs work :-) Thanks.

jhammock commented 3 months ago

Thanks for the update, Eli! I'll appreciate not having that to worry about until we're migrated :)

KatjaSchulz commented 3 months ago

Wonderful! Thanks Eli.

On Sun, Aug 4, 2024 at 7:49 PM Jen Hammock @.***> wrote:

Thanks for the update, Eli! I'll appreciate not having that to worry about until we're migrated :)

— Reply to this email directly, view it on GitHub https://github.com/EOL/ContentImport/issues/16#issuecomment-2267947901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSQBNDIT4XAMXRJXLQ5ACTZP24YPAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRXHE2DOOJQGE . You are receiving this because you were mentioned.Message ID: @.***>

eliagbayani commented 3 months ago

Hi Jen, @jhammock

After running a bulk migration and a number of manual migration that initially failed, all public datasets/resources from two organizations are now in Zenodo under the EOL Community

Aggregate Datasets
EOL Dynamic Hierarchy Data Sets

For review.
I used the keywords (Subjects) to have navigation to our resources. There is no equivalent organization->dataset->resources levels in Zenodo.

Thanks.

eliagbayani commented 3 months ago

@jhammock @KatjaSchulz Tip: If you know the complete title of your record in Zenodo. And you try to search it. Paste this in the search textbox: title:("Your Complete Title")

You can also search by Subject: subject("EOL Content Partners: Water Body Checklists")

More search tips here.

eliagbayani commented 2 months ago

@jhammock @KatjaSchulz Update: Generated an HTML page that will initially assist us in navigating the individual specific (public) resources in Zenodo. This HTML page was organized using our OpenData's original sections: organizations -> datasets -> resources. Zenodo doesn't have these type of sections. opendata_zenodo.html.zip Please unzip to get the HTML page. Thanks.

eliagbayani commented 2 months ago

@jhammock @KatjaSchulz All public datasets are now in Zenodo. I have not yet moved the private datasets from opendata.eol.org to Zenodo. Do we need to do that? If we do move them, they will take the 'restricted' option in Zenodo. Restricted means, the record is publicly accessible, but files are restricted only to users with access. Thanks.

jhammock commented 2 months ago

I think that status aught to suit most if not all such cases. @KatjaSchulz , we should both check, I suppose. If there's something we don't want to even announce that we have, we can move it offline for now.

eliagbayani commented 2 months ago

1st private record (restricted) e.g. WoRMS internal: World Register of Marine Species 'Restricted' status works as intended. If you're not logged then you will not be able to download the file. Will continue with the others.

eliagbayani commented 2 months ago

Status: From: Aug 28

Requested private resources now migrated to Zenodo (n=9).
One requested item seems to be non-existent in OpenData.eol.org anymore. [Dataset test 2019: dataset-test-2019]

jhammock commented 2 months ago

No concerns about the test dataset. It may not be the one currently in use, and we can always make up another.

jhammock commented 2 months ago

@eliagbayani I'm trying to orient myself to the zenodo interface. Can you explain this to me?

https://zenodo.org/records/13253933/files/13253933.dat?download=1

It's listed under "Files" at https://zenodo.org/records/13253933

eliagbayani commented 2 months ago

@jhammock The .dat file was a temporary file I used if the main file is not available during the migration. In this case the main file is: https://eol.org/data/full_provider_ids.csv.gz I assume during the time of migration this file was inaccessible after a number of tries thus it falls back to using the .dat file in order the record to be published.

jhammock commented 2 months ago

So the plan is for the intended files to replace the temp file ultimately, wherever it appears? Is manual editing needed?

eliagbayani commented 2 months ago

Yes, this one needs manual editing. step 1: click [New Version] step 2: upload the desired file, click button [Upload files] step 3: enter the Publication date step 4: finally click on [Publish] button.

eliagbayani commented 2 months ago

@jhammock, here is the New Version you initiated but was not completed. https://zenodo.org/uploads/13741713 Just in case you are looking for it.

jhammock commented 2 months ago

I can't remember starting that process so I discarded it. Just checking:

The plan is to have zenodo host the files?
My uploading it now will not interfere with your ability to update it automatically later? I believe this file is updated on a regular schedule. I think you're dealing with connector-based resources first, but in due course I presume the aggregated data files will be equipped for updates also.

eliagbayani commented 2 months ago

@jhammock case 1 - Yes, eventually Zenodo can host the files. Yes, your uploading it now, will not interfere with my ability to update it automatically later. If case 1 is met, we don't need a .dat file anymore.

case 2 - Or we provide just the URL e.g. https://eol.org/data/full_provider_ids.csv.gz as metadata in Zenodo record. If case 2 is met, we need to have a .dat file or any file (I chose .dat) uploaded to publish the Zenodo record.

jhammock commented 2 months ago

OK, I can see advantages to both cases, but if zenodo policy permits, I think I crave the redundancy of them hosting a copy of all files we list there. We'd presumably also have one of everything, eventually in your new docker instance, @eliagbayani . @KatjaSchulz do you concur?

eliagbayani commented 2 months ago

Yes I vote for redundancy as well. Thanks.

jhammock commented 2 months ago

Okay, I am getting familiar with zenodo metadata edits. I gather a new version of a resource is only required when the files associated with the record are changed. I have created v2 of the identifier map. I have also messed with some of the metadata, in several subsequent edits, and learned that this can be done while preserving the same version-specific doi. Yay!

@KatjaSchulz you should definitely review this one because I named you as the creator. You may prefer to name an institution, which is an option, or to name several creators. I am implicated also for the moment, in the contributor category, as a "contact person". We should probably hash out a policy about this kind of metadata in the zenodo context; the aggregate datasets will probably be case by case, but for the resource files we should be able to do something consistent- or a few different consistent things over different kinds of resources.

KatjaSchulz commented 1 month ago

Thanks Eli, this will be very useful.

On Tue, Sep 17, 2024 at 11:52 AM Eli Agbayani @.***> wrote:

@jhammock https://github.com/jhammock @KatjaSchulz https://github.com/KatjaSchulz Tip: If you know the complete title of your record in Zenodo. And you try to search it. Paste this in the search textbox: title:("Your Complete Title")

You can also search by Subject: subject("EOL Content Partners: Water Body Checklists")

More search tips here. https://help.zenodo.org/guides/search/

— Reply to this email directly, view it on GitHub https://github.com/EOL/ContentImport/issues/16#issuecomment-2284621628, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSQBNEJP52BLCOOXPTSOITZXBF2XAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGYZDCNRSHA . You are receiving this because you were mentioned.Message ID: @.***>

eliagbayani commented 1 month ago

@jhammock @KatjaSchulz Attached is a list of records where files are saved elsewhere (n=56). If I'm not mistaken, all should have a .dat file as its uploaded file. Except for one: [title] => identifier map: current version [URL] => https://eol.org/data/full_provider_ids.csv.gz [Zenodo] => https://zenodo.org/records/13253933

Where its latest version is now: EOL full taxon identifier map https://zenodo.org/records/13751009

Jen, Question, do you want me to proceed and create/run a script that will check the URLs if valid and upload the actual file to its respective Zenodo record? Of course a new version of the record will be created (Version 2) to have the uploaded file. If the URL is already broken then I don't change anything.

Or do you want these records handled manually by you and Katja? Thanks. FilesSavedElsewhere.txt

jhammock commented 1 month ago

Thanks, Eli!

Give us a moment to go through this list; at a glance a couple of these may just be odd ducks to be archived, or otherwise treated differently. I expect most of them will want that script, on a regular schedule.

More soon!

Jen

On Wed, Sep 18, 2024 at 11:03 AM Eli Agbayani @.***> wrote:

@jhammock https://github.com/jhammock @KatjaSchulz https://github.com/KatjaSchulz Attached is a list of records where files are saved elsewhere (n=56). If I'm not mistaken, all should have a .dat file as its uploaded file. Except for one: [title] => identifier map: current version [URL] => https://eol.org/data/full_provider_ids.csv.gz [Zenodo] => https://zenodo.org/records/13253933 Where its latest version is now: EOL full taxon identifier map https://zenodo.org/records/13751009

Jen, Question, do you want me to proceed and create/run a script that will check the URLs if valid and upload the actual file to its respective Zenodo record? Of course a new version of the record will be created (Version 2) to have the uploaded file. If the URL is already broken then I don't change anything.

Or do you want these records handled manually by you and Katja? Thanks. FilesSavedElsewhere.txt https://github.com/user-attachments/files/17046039/FilesSavedElsewhere.txt

— Reply to this email directly, view it on GitHub https://github.com/EOL/ContentImport/issues/16#issuecomment-2358725667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXC5B2B72EZLAGRW4TGGF3ZXGI4ZAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJYG4ZDKNRWG4 . You are receiving this because you were mentioned.Message ID: @.***>

KatjaSchulz commented 1 month ago

Hi Eli,

Jen & I just did a deep-dive on Zenodo and came up with a list of things we would like to change. Here are the things we hope you can do through the API:

Agents
1. For records that have Hosting institution: Anne Thessen under Contributors, remove the Contributors record, remove the "script (Zenodo API)" Creator and add the following as the new Creator:
  - Person
  - Name: Anne Thessen [important: do not link to any identifiers]
  - Affiliations: Encyclopedia of Life
  - Role: Data Manager
2. For all other records that have "script (Zenodo API)" as the Creator, remove this Creator and add the following as the new Creator:
  - Organization
  - Name: Encyclopedia of Life
  - Role: Hosting Institution
3. Remove all remaining Contributors with Role: Hosting Institution.
Keywords & subjects
1. For all data sets with keyword "EOL Content Partners: National Checklists 2019" or "EOL Content Partners: Water Body Checklists 2019" add keyword "deprecated"
2. Remove all keywords with the prefix "format:", e.g., "format: ZIP", "format: TAR", "format: XML", etc.
Notes: It looks like the Notes field in Zenodo currently contains a combination of the OpenData resource and organization description. We would like to handle this in a different way:
1. Please move the content that's currently in the Zenodo Notes field to the Description field instead. If there is already content in the Description field, append the content from the Notes field.
2. Please entirely remove this text from all Notes, i.e., do not include it in the text appended to the Description: "This is where EOL hosts source datasets (archives, dumps, etc.) from EOL content partners (especially partners without a web presence of their own). This organization will also include the content partner utility files EOL connectors use to generate a particular content partners resource EOL archive or XML. For questions or suggestions please visit the EOL Services forum at http://discuss.eol.org/c/eol-services ####--- EOL DwCA resource last updated: .... ---####"

Let us know if you have any questions.

eliagbayani commented 1 month ago

Hi Jen, @jhammock These are the 7 records under the EOL computer vision pipelines

I think I set these records initially to 'Restricted'. I'm not sure if my recent bulk updates have accidentally set these to 'Public'. Or have you set these to 'Public'? If not I'll just set them back to 'Restricted'. Thanks.

eliagbayani commented 1 month ago

@KatjaSchulz @jhammock The script finished doing the bulk updates. Zenodo Like what I mentioned before, it seems Zenodo's 'write API' is lagging behind what the interface can do. One is that the API cannot set a Creator to be of type 'Organization'. It always defaults to 'Personal'. Also the API cannot set the 'role' of the Creator. But it CAN set the 'role' of the Contributor.

Another API setback is that it cannot assign identifiers (e.g. ORCID) to Creators and Contributors.

Anyway, the rest of the requirements were met fine.

Also I removed all Contributors with my name 'Eli Agbayani'. These are just remnants of the old CKAN framework. But I set others like 'Jen Hammock' or 'Sarah Miller' as 'Contact Person'. Please tell me if we need to change this. And as proposed 'Anne Thessen' as 'Data Manager'. Thanks.

eliagbayani commented 1 month ago

Note to Eli: to do It is the case from the start. Find a way to use the apostrophe in API commands. It causes the API to fail even when it is escaped. Right now I replaced the apostrophe with 2 underscores "__".

So instead of: I'm going.
It is saved as: I__m going.

Seems like there should be an easy solution and the API should be able to handle it but haven't found the solution yet.

KatjaSchulz commented 1 month ago

Thanks Eli, It's unfortunate that the Zenodo API has these limitations, but none of them are a huge deal. Stay tuned for some more bulk tag updates. Hopefully those will be easy.

jhammock commented 1 month ago

Hi Jen, @jhammock These are the 7 records under the EOL computer vision pipelines

I think I set these records initially to 'Restricted'. I'm not sure if my recent bulk updates have accidentally set these to 'Public'. Or have you set these to 'Public'? If not I'll just set them back to 'Restricted'. Thanks.

I did set them to public, Eli, thanks for checking. Katie was inquiring about them; some colleagues of hers were interested in having a look.

eliagbayani commented 1 month ago

Noted Jen. No worries, will leave them as 'Public' then. Thanks.

eliagbayani commented 1 month ago

Update: Not quitting just yet. We can now use API (bulk updates) to update Creators and Contributors with their identifiers. Identifiers include ORCID and GND but not ISNI. Attached just an example: Just a sample

@KatjaSchulz , yes please just send me proposed bulk updates and hopefully are doable. Thanks.

KatjaSchulz commented 1 month ago

Hi Eli,

Could you please do a few more tag clean-ups?

Please add the tag "geography" to data sets that currently have one of the following tags:

EOL Content Partners: Arctic Biodiversity
EOL Content Partners: National Checklists
EOL Content Partners: Water Body Checklists

Please add the tag "descriptions" to data sets that currently have tag "EOL Content Partners: Wikipedia" Also, we think it would make sense if you added yourself as the creator (or contributor whatever you prefer) with role data manager for the Wikipedia data sets.

Once you have added the new tags, please remove all of the following tags:

EOL Content Partners: Arctic Biodiversity
EOL Content Partners: National Checklists
EOL Content Partners: Water Body Checklists
EOL Content Partners: National Checklists 2019
EOL Content Partners: Water Body Checklists 2019
EOL Content Partners: Wikipedia
EOL Content Partners
EOL Content Partners: Arctic Biodiversity

Thanks!

jhammock commented 1 month ago

Bug report! I think. We've found a few cases of zenodo records that resisted your bulk edits, Eli. This one is an interesting example, as it seems to have resisted both a Subject tag removal and a Contributor role change. Something to do with the history of the file edits, maybe? Or might this indicate transient errors during the running of the batch edit? Anyway, there don't seem to be a ton of these, so it's not critical, but if an easy experiment occurs to you for cleaning these up, it's worth trying.

eliagbayani commented 1 month ago

@jhammock Good catch Jen. Thanks. Found the culprit: same titles, different records. It is also the same way these records were saved in CKAN. The bulk-update script assumed that titles are unique. Thus missing 281 records. e.g.

Arctic Biodiversity: Arctic Freshwater Fishes https://zenodo.org/records/13315783 https://zenodo.org/records/13315751

Africa Tree Database https://zenodo.org/records/13312623 https://zenodo.org/records/13312619

Fairbairn, 2013 https://zenodo.org/records/13316319 https://zenodo.org/records/13316311

Ramirez, et al, 2008: Ramirez et al, 2008 https://zenodo.org/records/13310465 https://zenodo.org/records/13310461

Only the 2nd record among these pairs were processed. Anyway, all 281 records missed the last time are now processed as well.

eliagbayani commented 1 month ago

@KatjaSchulz Will I also add the tag 'geography' if the existing tags are:

EOL Content Partners: National Checklists 2019
EOL Content Partners: Water Body Checklists 2019

Or only add 'geography' strictly for values: without " 2019"

EOL Content Partners: National Checklists
EOL Content Partners: Water Body Checklists

Thanks.

jhammock commented 1 month ago

Good question! We mulled that over, but based on the zenodo search tools decided not. We're not confident of being able to filter conveniently to exclude deprecated datasets, so we don't want to give those any other tags.

eliagbayani commented 1 month ago

After a couple of adjustments. All proposed tag clean-ups here are now implemented. Zenodo. Thanks.

jhammock commented 3 weeks ago

I've started to mess around with tags and metadata and wanted to check something before I make a mess. Eventually, we'll need a mapping of old CKAN addresses to their corresponding zenodo addresses in order to update the resource file links in the harvesting layer. I wouldn't say automating this is super important, but if we have such a mapping already or could easily make one it will certainly be useful, and I want to make sure I'm not messing that up. I've started editing the Related Works metadata, adding two things so far:

is derived from [the publication or content partner database or whatever, outside EOL]
is source of [link to EOL resource page in the publishing layer] example comments welcome on those choices!

But more urgently, @eliagbayani , I've deleted a few "is supplement to" relationships, (like this one, not yet removed) thinking we only needed them in case of the file upload difficulties we had earlier. However, if those relationships are present on all our zenodo records, and are the easiest way to trace them back to the ckan records, perhaps I should hold off. Please let me know, what you think about that ckan<->zenodo mapping and in particular if I should leave the supplement relationships alone for that or any other reason. I do want to remove them eventually to avoid confusing our zenodo visitors, but there's no great rush.

eliagbayani commented 3 weeks ago

@jhammock , I'm exploring and will get back to your message. Thanks.

eliagbayani commented 3 weeks ago

@jhammock

your introduction of the relationship "is source of" is a welcome addition. It shows a clear link back to eol.org. I can also check if I can do a bulk-update to add the "is source of" relationship.
regarding the ckan<->zenodo mapping. I think I already have something like it. Please check this PDF. EOL_resource_id_and_Zenodo_id_file.pdf
the "is supplement to" relationship is relevant to those records where the file (DwCA) is something that we generate and have a connector for. e.g. FishBase I recommend we leave it for these records as I use it to link to our connectors. That is, to facilitate auto-update of respective Zenodo record after connector finishes. But we can remove it for those we don't have a connector for e.g. Reid et al, 2012

Thanks.

jhammock commented 3 weeks ago

Thanks for that quick investigation, Eli! Yes, that mapping looks like it will make the updating of our harvest layer links very easy when the time comes. So the important thing is for me not to bother the is-supplement relationships for the live connector resources. Where's the best place for me to refer to for a list of those? In the Jenkins?

If you can handily automate the is-source relationships, that would be grand; if not, no complaints. Let me know- if it is, I'll remove the ones I've entered manually, so you can make a clean job of the whole collection. That'll only need to be done once, and I'll probably end up removing a few afterwards. Not everything with a resource page in the publishing layer is published, approved, and non-redundant :)

eliagbayani commented 3 weeks ago

@jhammock ,

If the URL starts with https://editors.eol.org/eol_php_code/applications/content_server/resources/ then that is a connector resource.
you don't need to remove those manually added is-source relationships. Script should detect if one exists already and will ignore that record.

eliagbayani commented 2 weeks ago

@jhammock, Confirmed, we can add Related Works -> 'is source of' in bulk-updates. Thanks.

jhammock commented 2 weeks ago

Splendid! I'll leave that to you, then. Thanks :)

eliagbayani commented 1 week ago

Splendid! I'll leave that to you, then. Thanks :)

Finally finished adding Related Works -> 'is source of' relationships in Zenodo for all published EOL resources. Zenodo. Thanks.

EOL / ContentImport

Migrate OpenData datasets to Zenodo #16