geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Pipeline data needs a "backup" for archive and reproducibility #9

Closed kltm closed 5 years ago

kltm commented 6 years ago

Assuming that we have our releases in S3/CloudFront successfully, we still need to have a backup, especially of the upstream annotation data that we have captured for each of our monthly releases.

Talking to @cmungall , the options here may be:

kltm commented 6 years ago

I think that the OSF one (or the third) would be easier to handle short-term, but it would be very very nice to get something like the second for future use.

kltm commented 6 years ago

Digging into this a bit. We'd need a way of creating projects dynamically so we could get a DOI per release. As well, this will be a non-starter for us: http://help.osf.io/m/faqs/l/726460-faqs#what-is-the-individual-file-size-limit -- goa_uniprot_all(-src).gaf.gz have already crossed 5GB.

kltm commented 6 years ago

Okay, I'm now exploring DOIs and backups as separate problems, as osf.io seems to be one of the few places to easily do both and they are out for the foreseeable future for our use case. For DOIs, I'm now looking at creating BagIt archives containing metadata and "final" URLs for the files in the releases, with those archives then being binned somewhere to get the DOI using EZID, Zenodo, Dryad (if our metadata is CC0?), or similar. For backups, I'm looking at affiliated institutions that are connected to archives (e.g. Chronopolis) and LBL internal resources.

kltm commented 6 years ago

@cmungall I'll want to touch bases with you at some point to see if any of the above, or something else entirely, would align with other GO strategy points.

kltm commented 6 years ago

So far, EZID and Zenodo seem to align better. Also looking at minid, but I'm not sure of of its advantage over the other two, although the BD2K/NIH affiliation might be important. Ideally, we'd like to run no special software for generation and archiving of the metadata.

kltm commented 6 years ago

Given the larger file limits with Zenodo, and the (according to their docs) possibility of negotiation, it may be possible archive full releases.

cmungall commented 6 years ago

Separation of concerns seems like a good idea given the above

I had issues with the zenodo API timing out with small-mid size files before (100s of Ms as I recall)

Another option is git annex + something like archive.org. See discussion here: https://github.com/OBOFoundry/OBOFoundry.github.io/issues/494

Any time spent investigating bagit is well-spent. See also https://github.com/owlcs/owlapi/issues/375#issuecomment-350332230

kltm commented 6 years ago

Ah--good to know! I'm warming up to the idea of bagit as a proxy for archival/reproducable artifacts. I think, for our case, git annex would be another versioning layer, but not actually provide storage (we already do S3) or DOIs. The prov parts you mention look like the minid spec. minid, already apparently built on ezid, and ezid itself, seem likely candidates then.

Plan: Explore (bagit + ezid) and minid, drop zenodo for now. Wait on word about backup.

kltm commented 6 years ago

Dropping osf.io for now--further debugging unlikely to be productive as we are likely over the file limit.

kltm commented 6 years ago

Okay, we now have (https://osf.io/6v3gx/) the ability to have bdbags pointing to arbitrary releases, either as clobbering versions or as a series of dated releases. This is not completely satisfying as:

OSF is almost there, but does seem to have footing more appropriate for group "lab notebook".

kltm commented 6 years ago

The BDBags are small and Zenodo allows for DOI versioning, so that might be the better choice for us now... ...or, given that zenodo can work with github, we could just have a release there.

kltm commented 6 years ago

Going through the Zenodo docs, it is looking much more promising. As well:

 Heads up! We will be launching a new file API which is significant more performant than the current API and which supports much larger file sizes. The current API supports 100MB per file, the new supports 50GB per file. 

Given this, and the fact that they work with CERN and actively solicit larger use cases (from the FAQ http://help.zenodo.org/), I think it may be worth exploring them as an archive as well. Especially given that the survey of other service I've done the last couple nights has not really turned up much. @cmungall You may want to revisit this at some point for the OBO release.

kltm commented 6 years ago

Looking at 1.7G vs 3MB (~2500 models) for legacy versus noctua models. Even if there is a lot of growth, it will be a while before we push the sizes too large. We'll go ahead and compress to get in for CF for now.

kltm commented 6 years ago

Well, this is wrapping up. One of the few remaining major issues will be how much metadata we want to associate with the drops. Currently just me and Chris are on there as "creators", but given the mechanisms available for citations of the data drops (and funding) via ORCIDs, we may want to add all contributors to the DOI ref. As we cannot undo past metadata, only add for new version in the future, we probably want to get something kinda correct from the beginning. As this will be a monthly deposit, I will continue into sandbox for the time being, then transfer later.

kltm commented 6 years ago

Okay, ignore the metadata and play forward from now, waiting for archive later.

kltm commented 6 years ago

As well, we'll be moving forward with just crediting "archive maintainers", not all contributors.

kltm commented 6 years ago

Okay, we are now getting the reference set into Zenodo, but the full "archive" set it still having problems. Example is a failing run:

    05:34:41 INFO:zenodo-version-update:new bucket URL: https://zenodo.org/api/files/7972dc69-f032-4aab-8b66-daf02ded6a66
    05:35:20 Traceback (most recent call last):
    05:35:20   File "/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA/go-site/mypyenv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    05:35:20     chunked=chunked)
    05:35:20   File "/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA/go-site/mypyenv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    05:35:20     conn.request(method, url, **httplib_request_kw)
    05:35:20   File "/usr/lib/python3.5/http/client.py", line 1106, in request
    05:35:20     self._send_request(method, url, body, headers)
    05:35:20   File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    05:35:20     self.endheaders(body)
    05:35:20   File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    05:35:20     self._send_output(message_body)
    05:35:20   File "/usr/lib/python3.5/http/client.py", line 936, in _send_output
    05:35:20     self.send(message_body)
    05:35:20   File "/usr/lib/python3.5/http/client.py", line 905, in send
    05:35:20     self.sock.sendall(datablock)
    05:35:20   File "/usr/lib/python3.5/ssl.py", line 891, in sendall
    05:35:20     v = self.send(data[count:])
    05:35:20   File "/usr/lib/python3.5/ssl.py", line 861, in send
    05:35:20     return self._sslobj.write(data)
    05:35:20   File "/usr/lib/python3.5/ssl.py", line 586, in write
    05:35:20     return self._sslobj.write(data)
    05:35:20 BrokenPipeError: [Errno 32] Broken pipe

That said, the manual steps for this run fine, if a bit annoying: https://github.com/geneontology/pipeline/blob/master/README.md

We need to either track down and fix the python/requests issue, or just switch to the working curl.

kltm commented 5 years ago

Started on, but sandbox.zenodo.org seems to basically be paralyzed. Contacted them.

kltm commented 5 years ago

sandbox back up and working again.

kltm commented 5 years ago

Okay, actual progress. Using a chrome variant (this didn't work in ff), I caught the commands it was sending back, which used identifiers not found in either the deposition or files information. However, randomly poking around, the bucket urls themselves are also an API, which seem to contain the URLs and identifiers I needed. In retrospect, it seems I was either deleting the bucket "handle" (but not the file) or using the old API which didn't quite map onto the bucket universe. Now that both the DELETE and PUT are firmly and completely in bucket-land, this seems to now be working as expected. We'll see if it works tomorrow.

kltm commented 5 years ago

Last release seems to have been a success.