choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

Zenodo loader #2

Closed chrisiacovella closed 11 months ago

chrisiacovella commented 11 months ago

Description

This PR provides a set of functions that will take a zenodo DOI (or zenodo record id) and return the direct link to the associated datafiles (e.g., the gzipped hdf5 files).

Currently 3 functions have been defined:

The hdf5_from_zenodo function can accept either a zenodo DOI or zenodo record id; additionally, the code will sanitize the input if either of those are passed as urls. For example, all of the following are equally valid (and in this case equivalent):

files = hdf5_from_zenodo('10.5281/zenodo.3588339')
files = hdf5_from_zenodo('https://dx.doi.org/10.5281/zenodo.3588339')
files = hdf5_from_zenodo('https://zenodo.org/record/3588339')
files = hdf5_from_zenodo('588339')

I'll note that, if the user provides a DOI (or URL with a DOI) to hdf5_from_zenodo, it will put in a request to doi.org to get the appropriate zenodo url; that url is then used to extract the zenodo record_id. This is likely unnecessary, given that the record_id is the numeric portion of the DOI suffix (i.e., the numbers in: zenodo.3588339). This step could probably be skipped in favor of just extracting record_id from the DOI directly, but it is probably safer to just query doi.org as it does not require us to assume the structure of the DOI and doesn't really add much in the way of overhead.

chrisiacovella commented 11 months ago

I'm also committing this as a PR right now to see if the changes I made to the test_dev.yaml make the CI work.

chrisiacovella commented 11 months ago

Good question. I don't think urllib adds too much over just string processing for, say, getting the record_id (since I'd end up doing the same string parsing to the path field from urllib.urlparse), but certainly would make it a easier to add better validation as to whether we have a url (and whether it points to doi.org or zenodo.org) . Quickly, these seem to be simple validator functions that would make the code a little more robust, I think.

def is_url(url, domain):
    parsed = urlparse(url)

    if not 'http' in parsed.scheme:
        return False
    if not domain in parsed.netloc:
        return False
    return True 

def parse_record_id(url):
    if is_url(url, domain='zenodo.org'):
        parsed = urlparse(url)

        return parsed.path.split('/')[-1]
    else:
        return None

I'll add these functions in and make tests.