CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

COL database dumps all broken? #80

Closed sckott closed 3 years ago

sckott commented 3 years ago

(sorry if this is the wrong place to ask, but it seems rather important)

All COL database dump links from http://www.catalogueoflife.org/DCA_Export/archive.php seem to be broken. That is, trying to download them they all fail whether in browser or via curl, etc. Any thoughts? I've also emailed sp2000@sp2000.org

mdoering commented 3 years ago

Hi Scott, I suspect they have never been created. But I'll check with Naturalis. We keep all annual editions in both dwca and mysql dump format here: https://download.catalogue.life/col/

But that currently does not include the new releases since earlier this year which we will add in a few weeks. Be aware that the monthly releases from this year so far have used temporary ids (mostly UUIDs) which will differ from release to release for changed groups. The identifiers in current use by http://www.catalogueoflife.org/col/ are created by the old PHP software hashing the names and some other information. They do not exist in the release in ChecklistBank.

Only with the next release and full migration of the new portal and COL ChecklistBank to the regular domain catalogueoflife.org we will start issuing stable ids again. This is targeted for early/mid december.

mdoering commented 3 years ago

Note: I will start work on an DwC-A and ColDP export for any dataset in ChecklistBank next week. I will put those files for all COL releases in CLB into the download area then.

sckott commented 3 years ago

Thanks! The downloads from that DCA_Export/archive.php site did used to work when I set up this project https://github.com/sckott/col-sql to convert COL mysql to sqlite to be able to more easily use within R so users don't have to install mysql

October 6th is the day the downloads from that site stopped working https://github.com/sckott/col-sql/actions?page=2 -

However, just checked and the monthly downloads are working again as of yesterday (after I opened this issue) - see https://github.com/sckott/col-sql/actions/runs/356878161

Sounds like I shouldn't be using monthly dumps though since they have temporary ids. I'll switch to using the latest annual for now. And switch back to monthly after stable ids are available again - assuming monthly dumps will be supplied

mdoering commented 3 years ago

The monthly dwca and db dumps from both the DCA_Export/archive.php site and https://download.catalogue.life/col/ are based on the final mysql which has "converted" hashed IDs, so they are the ones in use on catalogueoflife.org and are stable. You can use them safely. In fact the mysql dumps should be exactly the same from both sites.

The only problem with IDs are actually in COL ChecklistBank. We move the data from there to the MySQL database and this conversion step creates the old, hased IDs.

sckott commented 3 years ago

okay, thanks!