bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
84 stars 26 forks source link

ENH: enable genome fetching using NCBI Datasets #153

Closed misialq closed 1 year ago

misialq commented 1 year ago

This PR adds a new get-ncbi-genomes action which can be used to fetch any genomes from NCBI by taxon using their new Datasets API. The action would return:

misialq commented 1 year ago

This is now ready for review. One (potential) issue here is that the q2-types-genomics version required by this new action is only available in the 2023.5 QIIME 2 conda channel: using this additional channel while creating a new environment (as it is described in the README - I updated the install command) results in an environment which is entirely 2023.5-dev. I have not found a way to only fetch q2-types-genomics from that additional channel (I think there is a bug/feature in conda which unfortunately prevents me from explicitly indicating this channel only for that package - installation fails in that case). Some solutions to this could be:

  1. wait with merging until 2023.5 is officially out (🙄 not my favourite as I have been very eager to finally release this action)
  2. update the README to clearly state that it would be the dev version until the official release (tbh, I don't like this either)
  3. any other ideas/suggestions?
nbokulich commented 1 year ago

Thanks @misialq ! I am excited to review this 😁

I propose the following steps somewhere between 1 + 2:

  1. we cut a stable 2023.2 release now
  2. once we have that release, we can review/merge this PR
  3. we update the README instructions for installing a stable release version vs. the dev version

Step 3 should be a separate PR to make it easier to revert (if desired later on). But for sure if this PR introduces a dependency on 2023.5 then we might want to have easy install instructions for those using older envs after 2023.5 anyway...

misialq commented 1 year ago

Thanks @nbokulich, that sounds perfect!

On your last point: yes, it also occurred to me today - it's not ideal and I'm thinking of making q2-types-genomics not always depend on the latest QIIME 2 version for this exact reason but we can discuss that elsewhere.

mikerobeson commented 1 year ago

Hi @misialq, just a quick drive-by comment. I've been able to download genomes, etc. Works great! I think printing a more informative error statement would be useful here. Something like: "No genomes were found that fit the query terms 'taxon', 'assembly-source', 'assembly level', 'reference', 'exact-match'."

Or if you are able to specifically determine which of these parameters did or did not return anything and print those? 🤷‍♂️

It took me a few tries to get the genomes I was interested in playing with, e.g. this failed:

qiime rescript get-ncbi-genomes \
    --p-taxon '147802' \
    --output-dir ~/Downloads/lacto-iners-ref-only  \
    --verbose

But this worked:

qiime rescript get-ncbi-genomes \
    --p-taxon '147802' \
    --p-no-only-reference \
    --output-dir ~/Downloads/lacto-iners-any \
    --verbose

This is so awesome!

misialq commented 1 year ago

Hey @mikerobeson, thanks for giving it a try! I'll look into it and see whether I can provide a more informative message 👌

misialq commented 1 year ago

Hey @mikerobeson, I have updated the error message a little, according to your suggestion. It is, unfortunately, not possible to find out which of the parameters should be updated so the message had to be a bit more generic... Let me know what you think 🙏