bokulich-lab / q2-types-genomics

QIIME 2 types for genomics plugins.
BSD 3-Clause "New" or "Revised" License
6 stars 11 forks source link

ENH: Update `NCBITaxonomyDirFmt` to accomodate data-version file #73

Closed Sann5 closed 10 months ago

Sann5 commented 10 months ago

About this repo

What's new

Set up an environment

# For linux: 
# export MY_OS="linux"
# For mac:
export MY_OS="osx" 
wget "https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-"$MY_OS"-conda.yml"
conda env create -n q2-shotgun --file qiime2-shotgun-2023.9-py38-osx-conda.yml
rm "qiime2-shotgun-2023.9-py38-"$MY_OS"-conda.yml"

Run it locally

  1. First, clone the repo and checkout the PR branch:

    # Remove q2-types-genomics so you can install your local version.
    conda activate q2-shotgun
    conda remove q2-types-genomics q2-types
    pip install git+https://github.com/qiime2/q2-types.git
    git clone git@github.com:bokulich-lab/q2-types-genomics.git
    cd q2-types-genomics
    gh pr checkout 73
    pip install -e .
  2. Let's get you some data to play with:

    cd wherever_you_want_to_download_the_data_to

The next commands will download ~15 Gb of data


# Download *.dmp
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip -j taxdmp.zip names.dmp nodes.dmp -d ncbi_tax_data
rm taxdmp.zip

Download prot.accession2taxid.gz

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz -P ncbi_tax_data

Make the version.tsv file

echo -e "file_name\tdate\ttime" > ncbi_tax_data/version.tsv ls -l -D "%d/%m/%Y %H:%M:%S" ncbi_tax_data | awk '{print $8, $6, $7}' | grep -E '(nodes.dmp|names.dmp|prot.accession2taxid.gz)' | tr ' ' '\t' >> ncbi_tax_data/version.tsv


3. Test it out!
```bash
qiime tools import --input-path ncbi_tax_data --output-path ncbi_tax_data.qza --type "ReferenceDB[NCBITaxonomy]"

Running the tests

pytest -W ignore -vv --pyargs q2_types_genomics
codecov[bot] commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (3b993ed) 96.77% compared to head (ac2141c) 96.91%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #73 +/- ## ========================================== + Coverage 96.77% 96.91% +0.14% ========================================== Files 42 42 Lines 1548 1620 +72 ========================================== + Hits 1498 1570 +72 Misses 50 50 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sann5 commented 10 months ago

Hey @VinzentRisch, can you give this one a review? Cheers!

VinzentRisch commented 10 months ago

Hey @Sann5, everything looks good to me. The tests run and the data can be imported without any issues. πŸŽ‰ I just had some problems with getting the right env. You added some new formats to q2-types and those formats are not in the 2023.9 distribution of QIIME2 so I had to install q2-types directly from github. And there is a typo in your import command. It should be ReferenceDB[NCBITaxonomy] and not ReferenceDB[TaxonomyNCBI]. But when i figured those two things out everything went smoothly. πŸ˜„

Sann5 commented 10 months ago

@VinzentRisch

I just had some problems with getting the right env. You added some new formats to q2-types and those formats are not in the 2023.9 distribution of QIIME2 so I had to install q2-types directly from GitHub. And there is a typo in your import command. It should be ReferenceDB[NCBITaxonomy] and not ReferenceDB[TaxonomyNCBI].

Crap! Thank you for checking and thanks for the review :). Ill update the PR message accordingly.

Sann5 commented 10 months ago

@misialq do you want to take a quick look before I SQUASH-megre it?

misialq commented 10 months ago

Yup, thanks, I'll check it out and ping you πŸ™Œ

Sann5 commented 10 months ago

We have decided not to go forward with this extension of the semantic type because the files (names and nodes.dmp) are updated very frequently and the last-modified-date information is already contained in the artifact without explicitly making a new file for it.

However, this branch will be pushed upstream just in case we wish to recycle some of the code further on.