get_unite_data functions

colinbrislawn commented 2 years ago

Early code to close https://github.com/bokulich-lab/RESCRIPt/issues/123

colinbrislawn commented 9 months ago

Good evening,

Now that we have a plan for the API, I'm ready to dive into implementation.

I'm very new to this sort of Python dev, so all feedback is greatly appreciated!

Is there a recommended linter / code-style I should use?

nbokulich commented 9 months ago

Thanks @colinbrislawn ! Exciting to see this coming together!

I have answered some of your questions inline and left a few additional comments. Please just ask questions as they come up and I can answer.

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

@mikerobeson any chance you could do a code review when the time is ripe?

mikerobeson commented 9 months ago

One general comment: could you move this code over to a new module, get_unite.py? Just to keep the code more organized for the different get-* actions. Thanks!

I agree. Separate this out into get_unite.py or get_unite_data.py, or something similar. For example, we have separate modules for:

get_data.py : SILVA
get_gtdb_data.py: GTDB
ncbi.py : GenBank
...

@mikerobeson any chance you could do a code review when the time is ripe?

Yarp! I'll look into this today and throughout the week.

Again thank you so much for working on this @colinbrislawn! 🌟

colinbrislawn commented 8 months ago

OK! I've moved this into a single new file as suggested.

I've also renamed and organized the helper functions so they chain together linearly! They now go:

doi = _unite_get_doi(version, taxon_group, singletons)
url = _unite_get_url(doi)
tgz = _unite_get_tgz(url, download_path)
...

I'm still struggling to import .fasta files, which makes me think I'm missing something obvious / fundamental about Q2 Artifacts.

fasta_file = open("/tmp/tmpduape95o/sh_refs_qiime_ver9_97_25.07.2023_dev.fasta", 'r')

qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file)

Error and traceback

``` qiime2.Artifact.import_data('FeatureData[Sequence]', fasta_file) Traceback (most recent call last): File "", line 1, in File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 327, in import_data return cls._from_view(type_, view, view_type, provenance_capture, File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 342, in _from_view pm = qiime2.sdk.PluginManager() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__ self._init(add_plugins=add_plugins) File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init plugin = entry_point.load() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load return self.resolve() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/home/biouser/bin/github-repos/RESCRIPt/rescript/plugin_setup.py", line 27, in from .cross_validate import (evaluate_cross_validate, File "/home/biouser/bin/github-repos/RESCRIPt/rescript/cross_validate.py", line 14, in from q2_feature_classifier._consensus_assignment import ( ImportError: cannot import name '_consensus_assignments' from 'q2_feature_classifier._consensus_assignment' (/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_feature_classifier/_consensus_assignment.py) ```

nbokulich commented 8 months ago

Hi @colinbrislawn regarding the error you get when trying to import sequences, it looks your branch is very far out of date, this is why this confusing error occurs. Could you please update your branch and try again?

colinbrislawn commented 8 months ago

Oh, good call!

I've synced my fork. Now every way I try to import shows this error:

TypeError: Parameter in callable without QIIME type: 'new_rank_handle'

Traceback

```python qiime2.Artifact.import_data('FeatureData[RNASequence]', fasta_file) Traceback (most recent call last): File "", line 1, in File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 327, in import_data return cls._from_view(type_, view, view_type, provenance_capture, File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 342, in _from_view pm = qiime2.sdk.PluginManager() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__ self._init(add_plugins=add_plugins) File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init plugin = entry_point.load() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load return self.resolve() File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/home/biouser/bin/github-repos/RESCRIPt/rescript/plugin_setup.py", line 228, in plugin.methods.register_function( File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/plugin/plugin.py", line 343, in register_function method = qiime2.sdk.Method._init(function, inputs, parameters, outputs, File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py", line 592, in _init signature = qtype.MethodSignature(callable, inputs, parameters, File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/type/signature.py", line 124, in __init__ self._parse_signature(callable, inputs, parameters, outputs, File "/home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/type/signature.py", line 198, in _parse_signature raise TypeError("Parameter in callable without QIIME type:" TypeError: Parameter in callable without QIIME type: 'new_rank_handle' ```

mikerobeson commented 8 months ago

The parameter new_rank_handle, is only used within the merge.py script, here. Since you were working from a much older version of the code, check to make sure that you have the most recent version of plugin_setup.py, and associated files, etc..

colinbrislawn commented 8 months ago

I have pushed a working demo!

There are a bunch of print statements for debugging. I can remove as many as needed.

This includes two versions of get_unite_data* with different internal logic. I'm not really happy with either, so I would appreciate your perspective on how best to organize the untaring and importing.

colinbrislawn commented 8 months ago

I continue to organize this pipeline and break it down into ✨clear steps✨

Current problem: Unlike other dbs, Unite gives us three clustering ids

It might be best to pull the information from the file name as you're walking through and use that to later "hard-code" a generic name for each of these output files, perhaps set up as a tuple or dict? Like this:

example_output = {
    'sequences' : 
        ('ver9_97_25.07.2023_dev',  
         <artifact: FeatureData[Sequence] uuid: 3f602392-6416-4742-9710-fd2778a95227>),
        ('ver9_99_25.07.2023_dev',
         <artifact: FeatureData[Sequence] uuid: 773ac252-3c04-42b4-a5bc-74d13fe9a629>),
         ...,
    'taxonomy':
        ('ver9_97_25.07.2023_dev', 
         <artifact: FeatureData[Taxonomy] uuid: f39464a8-b3a5-47f0-bb8c-9889c9da6431>),
        ('ver9_99_25.07.2023_dev',
          <artifact: FeatureData[Taxonomy] uuid: e8bd1e45-c311-4730-ac4f-126bd39ca50e>),
         ...}

Want me to return this data structure?

Unless... 🤔

All the other get_data_* functions return a single clustering level.

Just because UNITE includes three does not mean we want all three. (Are we going to distribute 97% dbs?)

@nbokulich, we already discussed this, but why not return a single clustering level?

get_unite_data(version, taxon_group, cluster_id = '99', singletons=False)

nbokulich commented 8 months ago

why not return a single clustering level?

I agree that most users will just want one clustering level at a time. But some might want to try different clustering levels. As this action would need to download all three in a bundle we might as well output all 3 and let the user decide which they want so that the user does not need to do this redundantly. On the other hand, having them select a specific clustering level would make this much more explicit in provenance... so I would be convinced if you want to go that route!

mikerobeson commented 8 months ago

RE multiple db downloads: I am fine either way, I just figured since the code is already downloading all of the files, we might as well save them. We can always encourage the use of the 99% clustering in the help text...

RE returning the data structure: it doesn't matter much to me. As long as it's easy to access the information in the way you need. My main point was to recommend explicitly keeping track of the file names & clustering level, so that they can be intuitively accessed. For example, a user might like to use the API directly, and it'll be easier for them to extract what they want. Basically, this will allow us the option to only return the 99% clustering via the CLI, but allow the user to pick what they want via the API. Though I am still in favor of returning all of the databases...

colinbrislawn commented 8 months ago

Thanks for talking through this with me.

The redundancy of redownloading made me want to keep everything. But the current messiness in the pipeline is making me reconsider.

Remember, each .tgz file includes 6 pairs of files, 3 pairs in the root and 3 more in developer/. For completeness, we could return 12 artifacts per DOI! 😬

a specific clustering level would make this much more explicit in provenance
so that they can be intuitively accessed

Passing cluster_id = 'dynamic' seems both explicit and intuitive to me.

Heck, the biggest UNITE file is only 226mb. Downloading three times is less than one CD

mikerobeson commented 8 months ago

Thanks for talking through this with me.

No worries. 😄

Remember, each .tgz file includes 6 pairs of files, 3 pairs in the root and 3 more in developer/. For completeness, we could return 12 artifacts per DOI! 😬

True, but we return as many artifacts from pipelines like core-metrics. So, I'm not too concerned, about the number of artifacts. But now that you mention it, I wonder if so many files will cause confusion for the user. They'll inevitably ask, "Which files do I use?". I guess for the sake of simplicity, we can just grab the specific files as your are currently doing? I figure if enough users want it all, they can let us know, and we implement later?

🤔 Sorry, I feel I am waffling back and forth here.

colinbrislawn commented 8 months ago

🤔 Sorry, I feel I am waffling back and forth here.

Let's give ourselves credit! I think we are both moving from "we might as well" towards an argument for simplicity.

I wonder if so many files will cause confusion for the user. They'll inevitably ask, "Which files do I use?".

And instead, we can suggest a reasonable default!

I figure if enough users want it all, they can let us know, and we implement later?

Exactly! If it's ever requested it would make a "good-first-issue"

colinbrislawn commented 8 months ago

I still need to add testing, but the main structure matches our conversation about cluster_id.

_unite_get_qza(cluster_id, download_path):
  # Returns: Tuple containing tax_results and seq_results
get_unite_data(version, taxon_group, cluster_id = '99', singletons=False):
  # Returns: Tuple containing tax_results and seq_results

I like how this works! Did I miss anything?

colinbrislawn commented 8 months ago

Quick flake8 questions:

from q2_types.feature_data import ??? #what's needed for these two lines?
...  qiime2.Artifact.import_data('FeatureData[Taxonomy]', file_path))
...  qiime2.Artifact.import_data('FeatureData[Sequence]', file_path))

During those imports, should I list format too?

mikerobeson commented 8 months ago

from q2_types.feature_data import ??? #what's needed for these two lines?
...  qiime2.Artifact.import_data('FeatureData[Taxonomy]', file_path))
...  qiime2.Artifact.import_data('FeatureData[Sequence]', file_path))

It depends...

If you look at the get_gtdb.py and test_get_gtdb.py, code. I have no such import statements. Because I have import qiime2 in the code and simply run qiime2.Artifact.import_data.

But to answer your question you can do something like:

from q2_types.feature_data import (FeatureData, Taxonomy, Sequence)

Though you may not need FeatureData...

I usually do try and provide a format too.

colinbrislawn commented 8 months ago

I don't want to cross link, but there's a bug that means downloads can fail silently then crash when extracting
https://github.com/psf/requests/issues/ 4956

I was tempted repeat the download until it fails, but that's asking for something

mikerobeson commented 8 months ago

If possible, you can simply set a specific number of attempts, like 10, before erroring out. We do this here for the get-ncbi-data action.

colinbrislawn commented 8 months ago

I've reorganized internals again and added some testing. I'll add patch() and test data next

colinbrislawn commented 8 months ago

Here are the questions I'm considering

How much flexibility should I allow in these tests? For the API endpoint, we want to query it to get new data and allow the maintainers to update new files.
Atomic testing presumes independence, but many of my functions are deeply related and meant to be called on each others outputs. Do I aim for atomic, or lean into the relatedness?

Thank you again for all your help. I'm treating this as a opportunity to learn modern python, so I appreciate the extra time you both are taking to help me understand why we build this way.

colinbrislawn commented 8 months ago

Where do the tests run and does it have internet access? Should I build these tests to work offline?

colinbrislawn commented 8 months ago

I'm adding citations.

The newest Unite paper I found on GScholar is 2019 PMC6324048.

Per the author's request:

When using this resource, please cite it as follows: Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; Kõljalg, Urmas (2023): UNITE QIIME release for Fungi. Version 18.07.2023. UNITE Community. https://doi.org/10.15156/BIO/2938079

When using this resource, please cite it as follows: Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; Kõljalg, Urmas (2023): UNITE QIIME release for Fungi 2. Version 18.07.2023. UNITE Community. https://doi.org/10.15156/BIO/2938080

So they want the database DOI included in the citation. What's a good way to do that?

mikerobeson commented 8 months ago

Regarding citations. I think you should add both, or as many relevant citations as needed. For example, we do this for SILVA, and GTDB. Just append the citations to this file.

colinbrislawn commented 8 months ago

TypeError: Missing builtin argument 'ctx', got 'version'

what on earth?

Full error:

```python (qiime2-2023.7) biouser@home-PC:~/bin/github-repos/RESCRIPt/rescript$ pytest --verbose tests/test_get_unite.py -k 'test_get_unite_data2' ============================================================= test session starts ==============================================================platform linux -- Python 3.8.15, pytest-7.4.0, pluggy-1.3.0 -- /home/biouser/miniconda3/envs/qiime2-2023.7/bin/python3.8 cachedir: .pytest_cache rootdir: /home/biouser/bin/github-repos/RESCRIPt plugins: anyio-3.7.1, typeguard-2.13.3 collected 0 items / 1 error ==================================================================== ERRORS ====================================================================______________________________________________ ERROR collecting rescript/tests/test_get_unite.py _______________________________________________../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/runner.py:341: in from_call result: Optional[TResult] = func() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/runner.py:372: in call = CallInfo.from_call(lambda: list(collector.collect()), "collect") ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/python.py:531: in collect self._inject_setup_module_fixture() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/python.py:545: in _inject_setup_module_fixture self.obj, ("setUpModule", "setup_module") ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/python.py:310: in obj self._obj = obj = self._getobj() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/python.py:528: in _getobj return self._importtestmodule() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/python.py:617: in _importtestmodule mod = import_path(self.path, mode=importmode, root=self.config.rootpath) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/pathlib.py:565: in import_path importlib.import_module(module_name) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) :1014: in _gcd_import ??? :991: in _find_and_load ??? :975: in _find_and_load_unlocked ??? :671: in _load_unlocked ??? ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:178: in exec_module exec(co, module.__dict__) tests/test_get_unite.py:12: in from qiime2.plugins import rescript :991: in _find_and_load ??? :971: in _find_and_load_unlocked ??? :914: in _find_spec ??? ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/plugins.py:438: in find_spec plugin = self._plugin_lookup(plugin_name) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/plugins.py:415: in _plugin_lookup pm = qiime2.sdk.PluginManager() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py:67: in __new__ self._init(add_plugins=add_plugins) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py:105: in _init plugin = entry_point.load() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2518: in load return self.resolve() ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2524: in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) plugin_setup.py:984: in plugin.pipelines.register_function( ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/plugin/plugin.py:389: in register_function pipeline = qiime2.sdk.Pipeline._init(function, inputs, parameters, ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py:751: in _init signature = qtype.PipelineSignature(callable, inputs, parameters, ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/type/signature.py:124: in __init__ self._parse_signature(callable, inputs, parameters, outputs, ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/type/signature.py:169: in _parse_signature raise TypeError("Missing builtin argument %r, got %r" % E TypeError: Missing builtin argument 'ctx', got 'version' =============================================================== warnings summary ===============================================================../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/cite.py:10 /home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/core/cite.py:10: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2871 /home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2871 /home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) ../../../../miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/sklearn/utils/multiclass.py:14 /home/biouser/miniconda3/envs/qiime2-2023.7/lib/python3.8/site-packages/sklearn/utils/multiclass.py:14: DeprecationWarning: Please use `spmatrix` from the `scipy.sparse` namespace, the `scipy.sparse.base` namespace is deprecated. from scipy.sparse.base import spmatrix -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================================================== short test summary info ============================================================ERROR tests/test_get_unite.py - TypeError: Missing builtin argument 'ctx', got 'version' !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!========================================================= 4 warnings, 1 error in 2.93s ========================================================= ```

mikerobeson commented 8 months ago

RE:

Where do the tests run and does it have internet access? Should I build these tests to work offline?

Can run them offline by constructing your own very small reference test set. As we do here several times within test_get_gtdb

mikerobeson commented 8 months ago

RE:

TypeError: Missing builtin argument 'ctx', got 'version' what on earth?

As you are registering this as a pipeline, you need to have ctx as a first argument here.

For example, we setup get_gtdb_data as a pipeline, and then add ctx here. The pipeline allows us to use ctx to easily access other actions as we do here.

FYI, for non-pipelines, you'd simply call it a method. See cull_seqs. Then, you do not need ctx in the function.

colinbrislawn commented 8 months ago

Thank you for explaining that to me. I didn't really know how these functions were wired up under the hood.

Do you recommend setting this up as a pipeline or function? I was modeling this after get_silva.py, but in retrospect I like how ncbi.py just works with the data directly without ever importing. I also like the mypy typing in ncbi.py.

mikerobeson commented 8 months ago

Do you recommend setting this up as a pipeline or function? I was modeling this after get_silva.py, but in retrospect I like how ncbi.py just works with the data directly without ever importing. I also like the mypy typing in ncbi.py.

Well, unless you are accessing other plugin actions, as we do in get_data.py & get_gtdb_data.py, there is no need for it to be a pipeline. In fact, that is why it is called a pipeline, as you can run many other actions in QIIME 2, for example think of the align_to_tree_mafft_iqtree pipeline.

colinbrislawn commented 8 months ago

Okay! My intention was for this to be a minimal and stand-alone function, so I'll take out the pipeline stuff! (Good to know about pipelines for later, though!)

colinbrislawn commented 8 months ago

What version of Qiime2 should I target when testing? I'm using 2023.7 and I think it's causing errors:

=================================================== short test summary info ===================================================ERROR rescript/tests/test_cross_validate.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_derep.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_evaluate.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_filter_length.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_get_data.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_get_gtdb.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_get_unite.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_merge.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_ncbi.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/tests/test_orient.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
ERROR rescript/types/tests/test_methods.py - AttributeError: module 'qiime2.core.type' has no attribute 'Artifact'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 11 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Also, coverage is not working for me

py.test --cov=rescript
ERROR: usage: py.test [options] [file_or_dir] [file_or_dir] [...]
py.test: error: unrecognized arguments: --cov=rescript
  inifile: None
  rootdir: /home/biouser/bin/github-repos/RESCRIPt

mikerobeson commented 8 months ago

It'd probably be best to install into the latest QIIME environment like this:

conda activate qiime2-amplicon-2023.9

conda install -c conda-forge -c bioconda -c qiime2 \
-c https://packages.qiime2.org/qiime2/2023.9/shotgun/released/  \
-c defaults   xmltodict 'q2-types-genomics>2023.5' ncbi-datasets-pylib

pip install <path to your specific branch>

qiime dev refresh-cache

qiime rescript --help

Although you can test all of RESCRIPt, I often just test the files, or specific functions like this:

python -m pytest rescript/tests/test_edit_taxonomy.py

or

python -m pytest q2_phylogeny/tests/test_iqtree.py::IqtreeTests::test_iqtree_ultrafast_bootstrap_singlebranch_methods

colinbrislawn commented 8 months ago

Thanks to your continuous support over the last year, I now have a fully working draft of this PR. 🙇‍♂️

In review we can

consider how I choose to divide functionality between function
improve testing in test_unite_get_artifacts()
improve test coverage of exceptions
improve test coverage

Name                         Stmts   Miss  Cover
get_unite.py                    57      6    89%
tests/test_get_unite.py         35      2    94%

All feedback is welcome.

colinbrislawn commented 8 months ago

Current version is passing all tests with 100% coverage. Thank you for all your help, Mike!

Nick, I'm ready for your feedback!

---------- coverage: platform linux, python 3.8.15-final-0 -----------
Name                                             Stmts   Miss  Cover
--------------------------------------------------------------------
__init__.py                                          5      0   100%
_utilities.py                                       59      1    98%
_version.py                                        279    154    45%
cross_validate.py                                  158      0   100%
degap.py                                             9      0   100%
dereplicate.py                                      74      0   100%
edit_taxonomy.py                                    31      1    97%
evaluate.py                                        161     14    91%
extract_seq_segments.py                             11      1    91%
filter_length.py                                    90      0   100%
get_data.py                                         81      9    89%
get_gtdb.py                                         73     33    55%
get_unite.py                                        57      0   100%
merge.py                                            41      0   100%
ncbi.py                                            304     29    90%
ncbi_datasets.py                                    79      2    97%
orient.py                                           21      0   100%
parse_silva_taxonomy.py                             80      0   100%
plugin_setup.py                                     90      0   100%
screenseq.py                                        28      0   100%
subsample.py                                         9      0   100%
tests/__init__.py                                    0      0   100%
tests/test_cross_validate.py                       102      0   100%
tests/test_degap.py                                 21      0   100%
tests/test_derep.py                                133      0   100%
tests/test_edit_taxonomy.py                         71      0   100%
tests/test_evaluate.py                              84      0   100%
tests/test_extract_seq_segments.py                  21      0   100%
tests/test_filter_length.py                        170      0   100%
tests/test_get_data.py                              41      6    85%
tests/test_get_gtdb.py                              51      2    96%
tests/test_get_unite.py                             38      0   100%
tests/test_merge.py                                103      0   100%
tests/test_ncbi.py                                 172     21    88%
tests/test_ncbi_datasets.py                        103      0   100%
tests/test_orient.py                                39      0   100%
tests/test_parse_silva_taxonomy.py                 138      0   100%
tests/test_screenseq.py                             38      0   100%
tests/test_subsample.py                             23      0   100%
tests/test_trim_alignment.py                       134      0   100%
trim_alignment.py                                   68      3    96%
types/__init__.py                                    3      0   100%
types/_format.py                                    55      0   100%
types/_transformer.py                               47      0   100%
types/_type.py                                       4      0   100%
types/methods.py                                     5      0   100%
types/tests/__init__.py                              0      0   100%
types/tests/test_methods.py                         22      0   100%
types/tests/test_types_formats_transformers.py     159      0   100%
--------------------------------------------------------------------
TOTAL                                             3585    276    92%

colinbrislawn commented 8 months ago

I remembered to test but forgot to lint. 😞 Flake8 and Black both pass now

nbokulich commented 8 months ago

thanks @colinbrislawn ! I have a few small changes to request, otherwise looks good!

colinbrislawn commented 8 months ago

@nbokulich I'm ready for more feedback!

nbokulich commented 8 months ago

Hi @colinbrislawn ,

the edits look good, thanks! I have done some user testing now and it is working like a charm 😉 . However, it looks like there is a problem with version 8.2, which fails to download with all confugurations, due to lowercase characters in the sequences. An easy fix would be to just drop version 8.2 as an option... but this issue could theoretically come up again in the future so this could be an opportunity to already fix this issue. I think that it should be as simple as using MixedCaseDNAFASTAFormat instead of DNAFASTAFormat, but I have not worked with that format before so cannot guarantee. Fortunately, we have the global expert on that format with us, @mikerobeson, in case you run into issues making that switch 😉

I am a little suprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC... but maybe this is something that they have started mofifying on their own more recently? Do you happen to know, @colinbrislawn ?

colinbrislawn commented 8 months ago

I'm glad it's working for you, Nick!

I am a little surprised that I did not get this same error with versions 8.3 or 9.0, as I thought that these mixed-case DNA sequences were a common feature of all UNITE releases that contain data pulled from INSDC

The newest releases have had only normal ([ATCG]) DNA, so I suppose that fixes it.

as simple as using MixedCaseDNAFASTAFormat

I had this implemented at one point. This PR is so old I've introduced regressions 😿

I'll switch over to this importing model and retest.

colinbrislawn commented 8 months ago

Done. Using MixedCaseDNAFASTAFormat is a little slower when not needed, but it works with 8.2.

I want to limit scope creep, so here's some cool things we shouldn't do (right now)

For specific database versions, switch to DNAFASTAFormat because it's faster
Support older databases by automatically handling .zip files
Refactor UNITE_DOIS from a table to data.frame and add more db metadata to it

nbokulich commented 8 months ago

thanks @colinbrislawn ! if this causes a significant slowdown and uppercase is the norm from 8.3 onward, then maybe we just drop support for 8.2 and go back to DNAFASTAFormat? Or if the switch was at 8.3 then it would also be easy to just specify the format with a conditional.

But otherwise I am happy if you want to just open an issue to save that idea for a rainy day 😁

by the way, looks like the linter is failing now.

colinbrislawn commented 8 months ago

I have gotta get some pre-commit hooks added

edit: done. I'll consider opening a new issue for more linting and pre-commit hooks

mikerobeson commented 8 months ago

Hi @colinbrislawn & @nbokulich. I think I might have suggested to go with DNAFASTAFormat, as I thought all the recent DBs where upper case. I am not too concerned about runtime, as most users will just have to run this once. But, I suppose we can add a try statement to import as DNAFASTAFormat, and if that fails, use MixedCaseDNAFASTAFormat? This way there is no need to explicitly define a format for each database version? This might also help future proof, in case UNITE goes back to mixed case. Or some other conditional? 🤷 But as Nick said, it might be good enough to not include ver 8.2 for now, and record this as an issue to be fixed for a rainy day and stick with DNAFASTAFormat. But I am fine if we just want to leave it as MixedCaseDNAFASTAFormat and call it good. :-)

colinbrislawn commented 8 months ago

record this as an issue to be fixed for a rainy day. :-)

I like this scope. We can build it when we need it! 🛠️

colinbrislawn commented 7 months ago

Before we merge, we should pick which test of _unite_get_tgz() to use.

mikerobeson commented 7 months ago

I just re-ran and tested the latest version of your PR. it all looks good! Actually, I think the download and parsing of the data is quite "fast" despite using MixedCaseDNAFASTAFormat. It took ~3 minutes using --p-version 9.0 --p-cluster-id dynamic --p-singletons. So, in light of that, I think we should just keep everything as you have it, using MixedCaseDNAFASTAFormat.

This looks ready to go, unless there is something I'm missing.

nbokulich commented 7 months ago

looks good to me too, thanks @colinbrislawn and @mikerobeson ! @mikerobeson please merge once you are ready (and the CI tests check out).

mikerobeson commented 7 months ago

Hey @colinbrislawn Perhaps you can write up a short Community Contribution similar to these:

GTDB
RDP

nbokulich commented 7 months ago

Yeah a short community tutorial would be great! @colinbrislawn is this something that you would be interested in putting together? This could be planned to coincide with the next QIIME 2 release (next month I think). I would also be happy to put one together if you and @mikerobeson are not interested, but I don't want to steal your thunder 😉

I put together a few commands when testing out your new action in case you want to use these. I found that in version 9.0 the accession IDs at included in the taxonomy string (something like kFungi;...;sspecies;sh__R12345). This is fine for the alignment-based classifiers, but really slows down the process of training a Naive Bayes classifier (and hitting a single accession is highly unlikely so this is a waste of time) so the accession IDs should be removed from the taxonomy string. Additionally, I would recommend removing sequences that are not classified to at least class level.

qiime rescript get-unite-data --output-dir unite-v9.0-fungi-dynamic-singletonsFalse --p-version 9.0 --p-cluster-id dynamic --p-singletons False --p-taxon-group Fungi

qiime taxa filter-seqs --p-exclude Fungi_sp,mycota_sp,mycetes_sp --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --i-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences.qza --o-filtered-sequences unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza

qiime rescript edit-taxonomy --i-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy.qza --o-edited-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --p-search-strings ';sh__.*' --p-replacement-strings '' --p-use-regex

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads unite-v9.0-fungi-dynamic-singletonsFalse/sequences-filtered.qza --i-reference-taxonomy unite-v9.0-fungi-dynamic-singletonsFalse/taxonomy-no-SH.qza --o-classifier unite-v9.0-fungi-dynamic-singletonsFalse/classifier.qza

mikerobeson commented 7 months ago

That is awesome @nbokulich! That is a great point about the accession labels! Makes me even happier for edit-taxonomy. :-)

But yeah, just having a simple explanation at each step would be great! For example:

1) bring up the importance of using eukaryotes to ensure we have outgroups for better classification of fungi / not-fungi. 2) simply mention amplicon region extraction sub-examples (as I've done for GTDB, and RDP tutorial), just point to the other RESCRIPt tutorials on how to do this. 3) Be sure to list the appropriate UNITE and RESCRIPt bibliography.

I've not worked with ITS data in a while, so I'm happy to leave it to you both. I'll Obviously, send any comments and revision ideas your way. :-)

bokulich-lab / RESCRIPt

get_unite_data functions #134