Closed Sidduppal closed 1 year ago
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit 7f86c81
+| ✅ 62 tests passed |+
#| ❔ 34 tests were ignored |#
!| ❗ 9 tests had warnings |!
One thing I noticed while testing the scripts:
The NCBI class is instantiated with the variable dbpath
while the GTDB class is instantiated with dbdir
. Nothing is breaking, but something to keep in mind as we go ahead and maybe make them both consistent.
Base: 27.40% // Head: 27.35% // Decreases project coverage by -0.05%
:warning:
Coverage data is based on head (
54c168c
) compared to base (86b9550
). Patch coverage: 30.30% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Looking at submitting a bioconda recipe for gtdb_to_taxdump and in the README it suggests using TaxonKit for retrieving stable taxids across GTDB releases (rather than arbitrarily assigning them as noted). Below is a screenshot from the gtdb_to_taxdump summary section:
TaxonKit is available via bioconda so would be easy to include in autometa-env.yml
and documentation for generating these taxdump files from the GTDB looks straight-forward as outlined in the GTDB-taxdump repo
TaxonKit conda page: https://anaconda.org/bioconda/taxonkit GTDB-taxdump page (using TaxonKit): https://github.com/shenwei356/gtdb-taxdump
NOTE: TaxonKit v0.12 or greater should be used. (ref)
The taxdump files may also be downloaded directly from the releases page: https://github.com/shenwei356/gtdb-taxdump/releases which reduces the compute requirements.
That being said, if we would like to generate the GTDB taxdump files using this, the steps are outlined here: https://github.com/shenwei356/gtdb-taxdump#steps
Looking in to the future, this is probably the more appropriate route as taxids may be relied up across future GTDB releases.
P.S there is also a related project linked (https://github.com/shenwei356/ictv-taxdump) that generates an NCBI taxdump for viruses and may be useful in the future (@kaw97, is this of any interest to you?)
Not sure if this affects what y'all are doing: https://github.com/shenwei356/gtdb-taxdump/issues/2
Not sure if this affects what y'all are doing: shenwei356/gtdb-taxdump#2
Only merged.dmp
and delnodes.dmp
are affected. If you just need to use the taxonomy data of the current version, say R207, don't worry.
Thanks @shenwei356
@WiscEvan The scripts have been updated to use taxonkit's already generated files. I decided to use merged.dmp
and delnodes.dmp
as they were provided with the taxdump. Although if you think the issue mentioned above by @chasemc would skew the results a LOT we can remove their usage. I'm mentioning the commands for testing:
autometa-config \
--section databases \
--option gtdb \
--value /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test
autometa-update-databases --update-gtdb
autometa-taxonomy-lca \
--blast 78mbp_metagenome.blastp.gtdb.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--sseqid2taxid-output 78Mbp_sseqid2taxid_test.tsv \
--lca-error-taxids 78Mbp_lcaErrorTaxids_test.tsv \
--verbose \
--lca-output 78Mbp_LCAout_test.tsv
autometa-taxonomy-majority-vote \
--lca 78Mbp_LCAout_test.tsv \
--output 78Mbp_gtdb_majority_vote.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--verbose \
--dbtype gtdb
autometa-taxonomy \
--votes 78Mbp_gtdb_majority_vote.tsv \
--assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna \
--output testTaxonomy \
--split-rank-and-write superkingdom \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb
autometa-binning-summary \
--binning-main 78_binningMain_gtdb.tsv \
--markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--output-stats binningSummartStats.tsv --output-taxonomy binningTaxa.tsv \
--output-metabins metaBins \
--metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna
By using this we don't need to add any additional dependency as well. Let me know what you think.
The tests and CI/CD still need to be resolved, but I think we're almost there.
Addressed all the comments.
🐛 🛠️ I'm not sure how you encountered the error below. Sam and I implemented a set -x / { set +x; } 2>/dev/null routine before and after running each module (https://github.com/KwanLab/Autometa/blob/gtdb_to_autometa/workflows/autometa_flagged.sh) that should allow easier inspection of a user's parameter configurations without them sending the entire submit file.
Still getting the following error when running autometa-large-data-mode-gtdb.sh. I'm getting the error during the large-data-mode binning (lines).
The set -x / { set +x; } 2>/dev/null routine has only been set for autometa_flagged.sh
and no other script. I'm running the large-data-mode workflow which does not have the above stated routine. The bug seems to be with the large-data-mode implementation as all the other workflows are running fine.
[10/16/2022 04:41:27 PM DEBUG] autometa.common.kmers: umap: 10 data points and 10 dimensions
[10/16/2022 04:41:27 PM DEBUG] autometa.common.kmers: Performing embedding with umap (seed 42)
/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/umap/umap_.py:2344: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
warn(
Traceback (most recent call last):
File "/home/sidd/miniconda3/envs/autometa_aims/bin/autometa-binning-ldm", line 33, in <module>
sys.exit(load_entry_point('Autometa==2.1.0', 'console_scripts', 'autometa-binning-ldm')())
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/Autometa-2.1.0-py3.9.egg/autometa/binning/large_data_mode.py", line 831, in main
main_out = cluster_by_taxon_partitioning(
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/Autometa-2.1.0-py3.9.egg/autometa/binning/large_data_mode.py", line 441, in cluster_by_taxon_partitioning
rank_embedding = get_kmer_embedding(
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/Autometa-2.1.0-py3.9.egg/autometa/binning/large_data_mode.py", line 112, in get_kmer_embedding
embedding.to_csv(cache_fpath, sep="\t", index=True, header=True)
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/pandas/core/generic.py", line 3551, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
csv_formatter.save()
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 241, in save
with get_handle(
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/pandas/io/common.py", line 694, in get_handle
check_parent_directory(str(handle))
File "/home/sidd/miniconda3/envs/autometa_aims/lib/python3.9/site-packages/pandas/io/common.py", line 568, in check_parent_directory
raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '/media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonkit/78mbp_metagenome_Autometa_Output3/78mbp_metagenome_bacteria_cache/species'
autometa-setup-gtdb
entrypoint for creating compatible GTDB databasesgtdb_to_taxdump
is now installedTaxonomyDatabase
classCommands to replicate:
Test GTDB
Running
autometa-setup-gtdb
entrypointRunning
autometa-taxonomy-lca
entrypointRunning
autometa-taxonomy-majority-vote
entrypointRunning
autometa-taxonomy
entrypointRunning
autometa-summary
entrypointTest NCBI
Running
autometa-taxonomy-lca
entrypointRunning
autometa-taxonomy-majority-vote
entrypointRunning
autometa-taxonomy
entrypointRunning
autometa-summary
entrypointTODO:
gtdb.py
PR checklist