ENH: add action for `get_unite_data`

nbokulich commented 2 years ago

As a plugin user, I would like to use RESCRIPt to download and format fungal ITS sequence and taxonomy data from the UNITE database.

The proposed get_unite_data would:

download sequence data from UNITE
download the corresponding taxonomy information
return FeatureData[Sequence] and FeatureData[Taxonomy]

The PlutoF API can be used to download data directly from UNITE: https://plutof.docs.apiary.io/#

Questions:

should we use this to download only UNITE versioned releases? Or allow more specific queries (e.g., taxon-specific)

colinbrislawn commented 2 years ago

get_silva_data() first makes URLs with _assemble_silva_data_urls then downloads them with _retrieve_data_from_silva. Is this the same approach we should take with UNITE?

I've contacted the PlutoF support team about how best to use their api, but right now the only query I have working is one that takes a single DIO and returns a JSON file pointing to a .tar.gz file containing these files:

 27M May 10  2021 sh_refs_qiime_ver8_97_10.05.2021_dev.fasta
 39M May 10  2021 sh_refs_qiime_ver8_99_10.05.2021_dev.fasta
 38M May 10  2021 sh_refs_qiime_ver8_dynamic_10.05.2021_dev.fasta
5.7M May 10  2021 sh_taxonomy_qiime_ver8_97_10.05.2021_dev.txt
8.2M May 10  2021 sh_taxonomy_qiime_ver8_99_10.05.2021_dev.txt
8.0M May 10  2021 sh_taxonomy_qiime_ver8_dynamic_10.05.2021_dev.txt

Would it make sense to build something like _assemble_unite_data_urls(DOI, download_sequences=True)?

Example:

Download page with DOIs: https://unite.ut.ee/repository.php
Example DOI: https://doi.org/10.15156/BIO/1264708
API query of that DOI returning json link to .tar.gz: https://api.plutof.ut.ee/v1/public/dois/?format=api&identifier=10.15156%2FBIO%2F1264708

Example:

API query returning ID of 'UNITE Community' organization: https://api.plutof.ut.ee/v1/public/organizations/?owner=&abbreviation=&cheat_full_name=UNITE+Community (How do I get projects or files from this org?)

I don't work with REST APIs much and am in way over my head 🤕

nbokulich commented 2 years ago

hey @colinbrislawn thanks for tackling this issue!

Let's see what the PlutoF support team says first before making an attack plan, and maybe @misialq or I could give you some more support later this week.

Would it make sense to build something like _assemble_unite_data_urls(DOI, download_sequences=True)?

Yep! I think we should parallel get_silva_data as much as possible.

takes a single DIO and returns a JSON file pointing to a .tar.gz file containing these files

We don't want all of those files — so ideally the PlutoF API would allow more specific queries. Let's see what they say!

colinbrislawn commented 2 years ago

From the PlutoF support team:

The way you approached PlutoF API is correct. Once you have your UNITE ref. dataset DOI, you can query it using the following url format - https://api.plutof.ut.ee/v1/public/dois/?format=api&identifier=10.15156%2FBIO%2F1264708

From this API response you get a file url in the linked Media section -

"media": [
                    {
                        "content_type": "application/gzip",
                        "doi": "",
                        "name": "sh_qiime_release_10.05.2021.tgz",
                        "url": "https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz"
                    }
                ],

which is the link you can use for downloading the file, e.g.

wget https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz

If you are interested in checking whether there are newer versions available for the same dataset, you can look at the "related_identifiers" section (e.g. for https://api.plutof.ut.ee/v1/public/dois/?format=api&identifier=10.15156/BIO/786385) -

"related_identifiers": [
                    {
                        "related_identifier": "10.15156/BIO/1264708",
                        "related_identifier_type": "DOI",
                        "relation_type": "IsPreviousVersionOf",
                        "resource_type_general": "Dataset"
                    }
                ],

If there is a related identifier of type "IsPreviousVersionOf", I suggest either 1) updating links in your script accordingly or 2) downloading the newer dataset instead by using automated checking that is already written into your code.

Welp, that's a clear goal at least: set up json parsing in python

misialq commented 2 years ago

Hey @colinbrislawn, thanks for the update! Let us know in case you need any support!

colinbrislawn commented 2 years ago

Thank you so much. (I'm am R dev, so I'm going to need a lot of support!)

Here is what I have so far:

import requests

def _assemble_unite_data_urls(DOIs, update_doi = False):
    '''Generate UNITE urls, given their DOIs.'''
    # Make DOIs iterable
    DOIs = [DOIs] if isinstance(DOIs, str) else DOIs
    # print('Get URLs for these DOIs:', DOIs)
    base_url = 'https://api.plutof.ut.ee/v1/public/dois/'\
               '?format=vnd.api%2Bjson&identifier='
    # Eventual output
    URLs = set()
    # For each DOI, get download URL of .tar.gz file
    for DOI in DOIs:
        query_data = requests.get(base_url + DOI).json()
        URL = query_data['data'][0]['attributes']['media'][0]['url']
        URLs.add(URL)
    return(URLs)

>>> _assemble_unite_data_urls(['10.15156/BIO/1264708', '10.15156/BIO/786385']) # list
{'https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz', 'https://files.plutof.ut.ee/public/orig/98/AE/98AE96C6593FC9C52D1C46B96C2D9064291F4DBA625EF189FEC1CCAFCF4A1691.gz'}
>>> _assemble_unite_data_urls(('10.15156/BIO/1264708', '10.15156/BIO/786385')) # tuple
{'https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz', 'https://files.plutof.ut.ee/public/orig/98/AE/98AE96C6593FC9C52D1C46B96C2D9064291F4DBA625EF189FEC1CCAFCF4A1691.gz'}
>>> _assemble_unite_data_urls({'10.15156/BIO/1264708', '10.15156/BIO/786385'}) # dictionary
{'https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz', 'https://files.plutof.ut.ee/public/orig/98/AE/98AE96C6593FC9C52D1C46B96C2D9064291F4DBA625EF189FEC1CCAFCF4A1691.gz'}
>>> _assemble_unite_data_urls('10.15156/BIO/1264708') # newest version only
{'https://files.plutof.ut.ee/public/orig/C5/54/C5547B97AAA979E45F79DC4C8C4B12113389343D7588716B5AD330F8BDB300C9.tgz'}
>>> _assemble_unite_data_urls('10.15156/BIO/786387')  # oldest version only
{'https://files.plutof.ut.ee/public/orig/01/38/0138B5D5EA2C77B8C2E5B910202FD3E60A9244FC31084E08DAD63E213A03BBFB.gz'}

How my python? Any feedback is appreciated.

Let's zoom out.

I'm new to python and REST APIs, and I'm a long way away from a quality PR.

While I work through this, what's the best way for me to share code and get feedback? Does it make sense to open up a draft PR just to get feedback on my prototypes as I figure this out?

colinbrislawn commented 2 years ago

It looks like PlutoF does not enforce bidirectional links when it comes to IsNewVersionOf and IsPreviousVersionOf

So DOI 10.15156/BIO/786387 IsNewVersionOf DOI 10.15156/BIO/786349, but that older DOI does not link forward to the newer version

I'm not sure if dynamic lookup of newer versions is feasible. I'll check with their team about this

colinbrislawn commented 2 years ago

should we use this to download only UNITE versioned releases? Or allow more specific queries (e.g., taxon-specific)

Now that I've got a bit more experience with the capabilities and limitations of the PlutoF API, let's revisit this.

For now, my best way to get a specific UNITE release is to use it's DOI. Each DOI is specific to a release version (v8.3, v8.2, etc) and one of four taxa scopes ("", "s", "all", "sall"). Each includes all three clustering levels (97, 99, dynamic).

I don't have an API query that builds a list of DOIs. We could scrape them from a web page, or encode them in a config file by hand. After doing this, we could request databases by version or tax scope. Or we could look up and plug in DOIs manually.

Thoughts?

nbokulich commented 2 years ago

Thanks for digging in @colinbrislawn !

I don't have an API query that builds a list of DOIs. We could scrape them from a web page, or encode them in a config file by hand.

That's fine. I think we can just create a dict of DOIs: {version_id: doi}

Users would request a release version and a taxa scope.

It would be quite nice to have an option to select the latest release by default, but it sounds like this might not be possible.

Each includes all three clustering levels (97, 99, dynamic).

Is there no way to only grab a specific file? I really do not think that we want to grab all three.

colinbrislawn commented 2 years ago

Sounds like a plan! I'll see if I can get that implemented. Would we put that dict into the function, or have it as a separate file?

Is there no way to only grab a specific file? I really do not think that we want to grab all three.

Not that I know of, as I've only found .tag.gz files. I'll ask the devs...

colinbrislawn commented 2 years ago

For now, only bundled .tar.gz files are distributed, so we will have to get everything and remove unneeded files.

For loading specific unite versions, should I use version (v8.3) or date (10.05.2021)? Should I put links between versions, DOIs, and files inside a .json file on inside the function itself?

mikerobeson commented 2 years ago

Sorry I am late to the party. If there is anything I can do to help let me know! Thank you so much for working on this @colinbrislawn!

nbokulich commented 2 years ago

thanks for following up, @colinbrislawn !

Downloading all files and only outputting one would be very inefficient. Is it possible to use the PlutoF API to search by taxonomic group? Then users could query specific taxonomic groups in a more flexible way, similar to what is possible with the get-ncbi-data action.

Otherwise @mikerobeson what's your take on this? I think the .tar.gz bundles contain 6 sequence/taxonomy pairs. In most cases a user would presumably only want 1 pair. But I suppose this still beats the status quo (i.e., manually downloading the bundle, formatting, and importing the UNITE database), so I am in favor if a direct query via PlutoF API is not possible.

so we will have to get everything and remove unneeded files.

if we go this route, let's get everything and output everything... let the user decide which to toss. One less parameter to expose, as well (the cluster threshold).

For loading specific unite versions, should I use version (v8.3) or date (10.05.2021)?

Version number please

Should I put links between versions, DOIs, and files inside a .json file on inside the function itself?

See how it is done for get_silva_data. I believe we have a dict of {version: DOI}, no need for a separate JSON.

mikerobeson commented 2 years ago

Hi @nbokulich & @colinbrislawn,

I am thinking we might as well process it all. Both the standard and dev versions (so 12 files total). _Are lower-case nucleotides still an issue with the dev versions? If so, I'll need to get to work on the mixed-case type for q2types!

I do like the idea of using the pluto API for perform specific taxonomic group searching. But I think just pulling the standard and dev files should be the first priority, then we can extend later? Unless the API is generally much easier to deal with...

I was also thinking, we can just start even more simple. Just let the user download the files manually, and just have a parser parse-unite-db that will import the files. Like we did with parse-silva-taxonomy... ?

nbokulich commented 2 years ago

Hey @mikerobeson & @colinbrislawn ,

Are lower-case nucleotides still an issue with the dev versions? If so, I'll need to get to work on the mixed-case type for q2_types!

@colinbrislawn what did you find when working with the latest release?

But I think just pulling the standard and dev files should be the first priority, then we can extend later?

Sounds like a good plan! So @colinbrislawn please proceed as you were planning before.

I was also thinking, we can just start even more simple. Just let the user download the files manually, and just have a parser parse-unite-db that will import the files. Like we did with parse-silva-taxonomy... ?

I would rather not go that route. Unlike SILVA, the UNITE release files already ship in a QIIME-compatible format (with the exception of the lowercase char issue). So a UNITE parser would not have the same added value. On the other hand, a method to download specific UNITE versions automatically and record that info in provenance would have high added value...

Thank you both!

colinbrislawn commented 2 years ago

@nbokulich, Yes, lower-case nucleotides are still a problem with the UNITE files.

Right now, I get URL(s) by passing DOI(s)

_assemble_unite_data_urls('10.15156/BIO/786387')

Each DOI includes all 3 clustering levels, with matching taxonomy, bundled in a .tar.gz file

sh_refs_qiime_ver8_97_10.05.2021_dev.fasta
sh_refs_qiime_ver8_99_10.05.2021_dev.fasta
sh_refs_qiime_ver8_dynamic_10.05.2021_dev.fasta
sh_taxonomy_qiime_ver8_97_10.05.2021_dev.txt
sh_taxonomy_qiime_ver8_99_10.05.2021_dev.txt
sh_taxonomy_qiime_ver8_dynamic_10.05.2021_dev.txt

I don't have a way to get these files independently, so we always get all of them. ¯\(ツ)/¯

There are 4 different DOIs for the 2x2 combination of

fungi or all
with or without singletons

Should I try to mirror the silva function?

_assemble_silva_data_urls(version, target, download_sequences=True):

The unite version might look like this

_assemble_unite_data_urls(version = "8.3", taxascope = "fungi", singletons = T)

I don't hate it, and that does let us parse the expected file names within the .tar.gz file. But we still need to get DOIs somehow. (I can't query DOIs from version and taxscope)

It's still pretty different from SILVA because the databases are structured so differently.

mikerobeson commented 2 years ago

Hi @nbokulich & @colinbrislawn,

I realized I've not provided an update on the status of importing lower / mixed-case nucleic-acid sequences. I have a PR for importing and converting MixedCase*nucleic-acid sequences. Once that is accepted, then we should be good to go.

colinbrislawn commented 2 years ago

Thanks, Mike. Yes, I think that would replace this line before import.

mikerobeson commented 1 year ago

Just wanted to let everyone know that support for importing mixed case DNA & RNA types is now part of qiime2-2022.11. So, perhaps we can start moving ahead with this?

colinbrislawn commented 1 year ago

So, perhaps we can start moving ahead with this?

We can try. The core issue is still getting data from the UNITE server, and all the questions from my first post remain.

Plus we have a new issue: Version 9.0 now ships three .tar.gz files with progressively newer dates. No release notes are provided. I asked about this and they simply told me to use the newest one. 🙃

Here's the path forward I can see:

Get DOIs by scraping this page or reading them from a file or inline dictionary
Download all three clustering levels (because they are bundled)
Use the last (newest) of the listed tar files

colinbrislawn commented 1 year ago

UNITE release_25.07.2023 no longer ships multiple tar files per DOI (at least for now).

It should now be a little easier to get a single tar for each DOI via API. Example DOI

colinbrislawn commented 1 year ago

importing mixed case DNA & RNA types

This may be a dumb question, does this meaning running toupper is not longer needed before import? Does the importer do this for you, automagically?

mikerobeson commented 1 year ago

Hi @colinbrislawn,

Correct. You should be able to simply import mixed case and lower case sequences, without doing any extra work. They will be converted to their respective uppercase DNA or RNA format types.I have an example in the forum here.

colinbrislawn commented 1 year ago

Should I aim to download from a list of DOIs, or just worry about one DOI at a time?

def get_unite_data(dois):
    '''For each UNITE DOI, make 3 pairs of artifacts'''

def get_unite_data(doi):
    '''Make 3 pairs of artifacts'''

Context for how Unite is organized

each release has 4 DOIs, of different tax scopes (with and without singletons, with and without fungi)
each DOI includes both reads and taxonomy for 3 clustering levels (97, 99, and dynamic)

nbokulich commented 1 year ago

@colinbrislawn thank you for getting back to this!

I suggest one DOI at a time. But what would be the API for users? Do they input a DOI or (similar to how we do for getting the SILVA db) should they request a version # and the taxon scope and RESCRIPt uses this to find the right DOI? The latter would be closer to what we do with SILVA and involve less lookup for the user, but it would avoid us needing to update each time a new UNITE version is released. On the other hand inputting any DOI could be a security risk and/or prone to bugs and typos! So I think we hardcode the choices. @mikerobeson any thoughts?

colinbrislawn commented 1 year ago

I suggest one DOI at a time.

Understood! 👍

But what would be the API for users?

That's my question too.

If we try to avoid DOIs, both 'taxa scope', and date are needed. And we have to update each release. (version is a nested subset of date)

def get_unite_data(date, taxa_scope)

# UNITE Taxa scope and singleton inclusion
taxa_scope:
  - ""
  - "s_"
  - "all_"
  - "s_all_"

What about those 3 clustering levels?

def get_unite_data(dois, use_clustering_id = '99'):
    '''Each UNITE DOI makes a pair of artifacts'''

mikerobeson commented 1 year ago

I agree with @nbokulich that we follow the approach we've used for get-silva-data and get-gtdb-data, and let the tool handle all the fetching with hard coded links.

That being said, one thing we did for SILVA, is allow users to download and import all the files themselves, run parse-silva-taxonomy. This allows users to potentially fetch older versions of SILVA. We did not provide this functionality with GTDB. But perhaps it'd make sense to allow users the ability to parse the files they download?

Anyway, I say we provide hardcoded options for the two most recent versions. If we start to see many requests to fetch other versions, then we can consider providing other separate actions to parse files like we did for SILVA.

nbokulich commented 1 year ago

If we try to avoid DOIs, both 'taxa scope', and date are needed. And we have to update each release.

Yeah that's strange that there are multiple dates for version 9.0. That's fine, we could use date instead of version. Or call the parameter version but use date as the version name. E.g., --p-version 2023-07-18

And the other parameter would be taxon-group. Options: fungi or eukaryotes.

The third parameter would handle which singletons group to use.

What about those 3 clustering levels?

As we need to download all 3 we might as well just output all 3 files and let the user decide which they want.

I say we provide hardcoded options for the two most recent versions.

I agree! No need to go overboard. Users can open an issue to request more.

providing other separate actions to parse files like we did for SILVA

SILVA uses some special formats... UNITE already releases the QIIME-compatible datasets so I don't think that we need a special parser.

mikerobeson commented 1 year ago

SILVA uses some special formats... UNITE already releases the QIIME-compatible datasets so I don't think that we need a special parser.

Oh right! I had forgotten that UNITE already provides them as QIIME compatible. Disregard. 🤦

colinbrislawn commented 1 year ago

I think we should avoid the version keyword for UNITE as it does not match common conventions. The newest version is still 9.0, now covering 4 (!!) dates and 8 total DOIs.

strange that there are multiple dates for version 9.0

This is common for UNITE; right now, the new newest version has a release date of 2023-07-18, but the internal file was updated on 2023-07-25 and has that in the file name.

The only way for users to get these dates is to first enter the DOI on the website. Which is why I was thinking of using DOI as the primary input, instead of date.

I think we are the main users of this API, as we plan to distribute databases on the data-resources pages.

Take the current UNITE release:

Here's two options to get those 4 versions:

get_unite_data(date = '25.07.2023', add_all_euks = False, add_singletons = False)
get_unite_data(date = '25.07.2023', add_all_euks = False, add_singletons = True)
get_unite_data(date = '25.07.2023', add_all_euks = True, add_singletons = False)
get_unite_data(date = '25.07.2023', add_all_euks = True, add_singletons = True)

get_unite_data(doi = '10.15156/BIO/2938079')
get_unite_data(doi = '10.15156/BIO/2938080')
get_unite_data(doi = '10.15156/BIO/2938081')
get_unite_data(doi = '10.15156/BIO/2938082')

What feels right?

nbokulich commented 1 year ago

Hi @colinbrislawn ,

I like the first option. But the parameter names may need to change to be more consistent with the nomenclature used by UNITE.

The DOIs are shorter of course, but not human readable. So I would not be in favor of passing a DOI. I prefer the other example that you gave.

The only way for users to get these dates is to first enter the DOI on the website. Which is why I was thinking of using DOI as the primary input, instead of date.

Or by checking the help docs. The date format should be YYYY-MM-DD as on UNITE to be consistent and to sort by year first (for easier reading in the help docs). We would also always default to the latest release, so the user would not need to search around for a date unless if they are looking for a specific release.

I think we are the main users of this API, as we plan to distribute databases on the data-resources pages.

But that does not make us the main users. E.g., a significant number of users uses RESCRIPt to download and build the SILVA database even though we also release this on data-resources.

colinbrislawn commented 1 year ago

This is very helpful to me, Nick. Thank you!

But the parameter names may need to change to be more consistent with the nomenclature used by UNITE.

That makes sense. What do you recommend?

We would also always default to the latest release, so the user would not need to search around for a date unless if they are looking for a specific release.

okay!

get_unite_data() # downloads what's current

Or by checking the help docs. The date format should be YYYY-MM-DD as on UNITE to be consistent and to sort by year first (for easier reading in the help docs).

okay!

get_unite_data(date = '2023-07-18', ...)

Running that would download 2023-07-25. Are you okay with that?

The plan is to hardcord a table connecting dates to URLs?

mikerobeson commented 1 year ago

Running that would download 2023-07-25. Are you okay with that?

If you want to make it clear to the user what they are downloading, you can add print statements that are printed to screen with the --verbose option:

We are downloading the latest files for "Ver 9.0 2023-07-18", specifically "DOIs: 10.15156/BIO/2938082, ..., ..." as updated on "2023-07-25"... or something like that... 🤔

nbokulich commented 1 year ago

Parameter names: "version" or "date" (see below) "taxon-group" (fungi / eukaryotes) "singletons"

Running that would download 2023-07-25. Are you okay with that?

Hm... if the dates are also meaningless then maybe let's use version instead and point to the most recent release of that version? So e.g., version 9 would download the 2023-07-25 release. If any user wants another date they can manually download and import. @colinbrislawn @mikerobeson what do you think?

The plan is to hardcord a table connecting dates to URLs?

a dictionary

colinbrislawn commented 1 year ago

if the dates are also meaningless then maybe let's use version instead and point to the most recent release of that version? So e.g., version 9 would download the 2023-07-25 release.

What happens when they add another tar file to the new 9.0, like they did for old 9.0?

I want a 'primary key' that unambiguously points to a single tar file.

Version? Nope, there's two version 9.0s DOI? Nope, the older version 9.0 has three files per DOI Date. OK, which one? ![image](https://github.com/bokulich-lab/RESCRIPt/assets/10355152/9ba60f8a-f0d3-46a7-9537-f68a5b5b69f6) Nope, just like DOIs, this links to multiple tars in old 9.0. ```python get_unite(date = '2022-10-16', ...) # the old 16.10.2022.tgz or the newest 29.11.2022.tgz? or the middle one? ``` The only primary key that is unique is the date in the uploaded tar file and the tar file hash. ![image](https://github.com/bokulich-lab/RESCRIPt/assets/10355152/8711cb85-35d3-46fe-984e-ace3c984318d) But there's no good way to find that!

They have bungeled this really thoroughly.

The core problem is how to access the three files within old version 9.0, and the danger they do that again. Both DOIs and dates would work fine if we do not provide a way to download older files within a DOI.

When I emailed plutof.ut.ee support, they told me to always use the newest one in a DOI.

Are we cool with that?

EDIT: I've put together google sheet of unite links

nbokulich commented 1 year ago

What happens when they add another tar file to the new 9.0, like they did for old 9.0?

We would make that key:value pair, it's not something that we cook into a URL... so we can refer to a version and then point to a specific unambiguous (e.g., most recent) release of that version.

So, e.g., the urls would be in a dict like so:

unite_dois = { 9.0: {'fungi': {'singletons': unambiguous-url-of-the-latest-9.0-release} ...} 8.0: {'fungi': {'singletons': unambiguous-url-of-the-latest-8.0-release} ...} }

so we always explicitly point to the latest release of each version (and can update if they do this again)

But date is also fine as the key, as long as this is unique.

mikerobeson commented 1 year ago

When I emailed plutof.ut.ee support, they told me to always use the newest one in a DOI. Are we cool with that?

I think what is what we were going for initially. As @nbokulich reiterates:

unite_dois = { 9.0: {'fungi': {'singletons': unambiguous-url-of-the-latest-9.0-release} ...} 8.0: {'fungi': {'singletons': unambiguous-url-of-the-latest-8.0-release} ...} }

Too bad they simply do not provide a "latest version" link. That would make things so much easier. 😩

colinbrislawn commented 1 year ago

Thank you both for your extensive help.

I understand the plan, and better yet, I think I can build it!

_assemble_silva_data_urls(version = '9.0', taxon-group = 'fungi', singletons = 'no singletons')

Let's circle back to this tradeoff:

so we always explicitly point to the latest release of each version (and can update if they do this again) vs Too bad they simply do not provide a "latest version" link.

unite_urls = {
'9.0': {'fungi': {'no singletons': 'https://files.plutof.ut.ee/public/orig/example.tgz'} ...} }

Super stable! But could be stale until someone updates it.

unite_dois = {
'9.0': {'fungi': {'no singletons': '10.15156/BIO/2938079'} ...} }

Then we query the API for the URL of the most recent file added to that DOI. Maybe unstable, but maybe more up to date? (This is what plutof.ut.ee support told me to do.)

nbokulich commented 1 year ago

Yes sounds good. Minor adjustment: {'9.0': {'fungi': {False: '10.15156/BIO/2938079'} ...} ... }

(I like making singletons a boolean the way you had it before... this makes more sense than a str as I had it in my example)

Thanks @colinbrislawn !

bokulich-lab / RESCRIPt

ENH: add action for `get_unite_data` #123