AtlasOfLivingAustralia / ala-downloads

Data downloads
https://downloads.ala.org.au
1 stars 4 forks source link

Mint DOIs for static downloads #27

Open nickdos opened 5 years ago

nickdos commented 5 years ago

Now that (dynamic) downloads have DOIs, we need to add them to the static downloads serverd from https://downloads.ala.org.au.

I suggest an admin function to mint a DOI for any legacy entries and for new records downloads, the DOI should already be generated (see related issue) thus the app should read the DOI from the DOI service(?). Should be applied to any entry, including software artifacts like Open Delta.

https___downloads_ala_org_au_download_list_and_inbox_ _csiro_and_inbox__1__-_nick_remedios-cole_id_au_-_remedios-cole_mail_and_workspace_1___kanban_and_ansible_ _dos009_ip-172-30-0-238____ _-bash_--login_ _128x37

Also requires the DOI to be included in the public display of the download entry as well as being included in the README (or equiv) file contained within the download (not sure how we do this with legacy download files - might just be a manual step for those few files/entries).

ansell commented 5 years ago

The contents of the file behind a DOI are generally not updated over time. The downloads offered by downloads.ala.org.au are designed to be updated frequently without any history over time. DOIs are not designed to simply be bookmarks or URL shorteners to a resource.

nickdos commented 5 years ago

Thanks @ansell - I'm not sure I understand your comment or possibly my issue was unclear. I'm suggesting that we generate a new DOI every time we upload a new static download to downloads.ala.org.au.

I see those files as being no different to the regular downloads, they are merely convenience files that can be obtained without having to wait for the regular download to run, as they are very big file.

ansell commented 5 years ago

The workflow for updating most of the files is to generate them using the data management jenkins server, then copy them to archives.ala.org.au.

We don't upload new versions of those files directly to downloads.ala.org.au. downloads.ala.org.au currently regularly polls the files on archives.ala.org.au to see if they have changed and updates its local timestamp/metadata at that point.

By comparison the offline downloads provided by biocache.ala.org.au are HTTP POST'd to doi.ala.org.au where they are archived in the AWS S3 bucket, and they should never change after that point.

When people use DOIs to reference things, there is the expectation that the information contents will not change over time (even if the syntax changes at some point in the future). If we were going to assign DOIs to files on archives.ala.org.au when changes are detected by downloads.ala.org.au, the files should be archived in the S3 bucket by doi.ala.org.au if we want to follow the DOI recommendation and provide the files in the future.

ansell commented 5 years ago

The reference to https://biocache.ala.org.au/archives/ may also be confusing the issue. The AWS Load Balancer pushes those requests directly to https://archives.ala.org.au/archives/ so they never hit ala-hub or biocache-service

nickdos commented 5 years ago

Assuming we do go ahead and mint DOIs on these downloads, then it seems we'll need to make major changes to the process of generating them, storing them (no longer on archives but on DOI server) and referencing them on downloads.ala.org.au. And we accept that all historical versions of these files will be stored on the DOI server, going back to now, which could be a significant disk space over time.

So it seems to come down to the question of whether we think its a good idea to have DOIs on these downloads and whether the costs I mentioned above are worth the benefits of having the DOIs.

I personally think it's worth it, for 2 reasons: 1. consistent messaging from ALA saying we want DOIs to be used wherever possible (and the converse argument that being inconsistent sends a bad message to users that we're not really serious and can't be bothered in some areas). And 2. Some people will use this data in publications and there is no easy way for someone to reproduce their work as that download has since been replaced by a newer one - DOI'ing them will ensure that exact download file is available for anyone to grab no matter how many years later (in theory).

ansell commented 5 years ago

DOI doesn't store files on disk right now. Everything is in a single S3 bucket, which we could work on to make storage of old files cheaper.

I like DOIs, just working through what would be necessary to change in this case.

nickdos commented 5 years ago

That's all good - better to have more details so we can realistically estimate the work.

M-Nicholls commented 5 years ago

If we're going to look at the static downloads let's do the whole thing - which static downloads should we produce at what frequency? (do we have download stats for the current set?) what others do we need (e.g. a full ALA dump) I think these should have DOI's UI and UX (how do users find these and when to refer users to them instead of the other functionality) systems architecture for how they are produced

ansell commented 5 years ago

ala-downloads should be currently pushing statistics about record downloads, on a data resource/collection/institution/data provider basis, through to logger.ala.org.au, but I haven't verified that it is actually working:

https://github.com/AtlasOfLivingAustralia/ala-downloads/blob/master/grails-app/services/au/org/ala/downloads/LoggerService.groovy#L13

That linkage is based on the metadata.json file that is produced by biocache-store and stored next to the .zip files on archives.ala.org.au.

Eg,:

https://archives.ala.org.au/archives/exports/Algae/metadata.json

https://archives.ala.org.au/archives/exports/Algae/

Does doi.ala.org.au push statistics to logger.ala.org.au currently? biocache-service is pushing statistics about offline downloads to logger.ala.org.au before DOIs are assigned, but doi.ala.org.au has all of the necessary information in the metadata to enable it to link to logger.ala.org.au in the future.