bhattlab / phanta

Workflow to rapidly quantify taxa from all domains of life, directly from short-read human gut metagenomes
MIT License
60 stars 9 forks source link

Testing installation using demo #17

Closed hgingras closed 1 year ago

hgingras commented 1 year ago

I have installed the requirements to run the demo files for Phanta analysis. I run on clusters without conda.


Here is my slurm output:

... -------- ((modules load and virtualenv prep) all good...) ------- Followed by: Building DAG of jobs... MissingRuleException: No rule to produce / (if you use input functions make sure that they don't raise unexpected exceptions). sample19 /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/sample19_R1.fastq.gz /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/sample19_R2.fastq.gz

sample18 /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/sample18_R1.fastq.gz /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/sample18_R2.fastq.gz


Here is my submission script:

!/usr/bin/env bash

SBATCH --account=def-helene

SBATCH --nodes=1

SBATCH --cpus-per-task=16

SBATCH --time=0-02:00

SBATCH --mem=32GB

module load StdEnv/2020 python/3.10 gcc/9.3.0 r/4.2.2 kraken2/2.1.2 bracken/2.6.0

virtualenv --no-download $SLURM_TMPDIR/env source $SLURM_TMPDIR/env/bin/activate pip install --no-index --upgrade pip pip install --no-index -r requirements.txt

snakemake -s /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/Snakefile \ --configfile /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/config_test.yaml \ --jobs 99 --cores ${SLURM_CPUS_PER_TASK-1} \ /


Here is the requirements.txt obtained when installing these wheels in my virtualenv: pip install --no-index --upgrade pip pip install --no-index pandas==2.0.0 pip install --no-index numpy==1.24.2 pip install --no-index snakemake==7.3.8


appdirs==1.4.4+computecanada attrs==23.1.0+computecanada certifi==2022.12.7+computecanada charset_normalizer==3.1.0+computecanada ConfigArgParse==1.5.3+computecanada connection_pool==0.0.3+computecanada datrie==0.8.2+computecanada decorator==5.1.1+computecanada docutils==0.19+computecanada dpath==2.0.5+computecanada fastjsonschema==2.16.3+computecanada gitdb==4.0.10+computecanada GitPython==3.1.31+computecanada idna==3.4+computecanada Jinja2==3.1.2+computecanada jsonschema==4.17.3+computecanada jupyter_core==5.2.0+computecanada MarkupSafe==2.1.2+computecanada nbformat==5.7.3+computecanada numpy==1.24.2+computecanada pandas==2.0.0+computecanada plac==1.3.5+computecanada platformdirs==3.2.0+computecanada psutil==5.9.4+computecanada PuLP==2.6.0+computecanada py==1.11.0+computecanada pyrsistent==0.19.3+computecanada python-dateutil==2.8.2+computecanada pytz==2023.3+computecanada PyYAML==6.0+computecanada ratelimiter==1.2.0.post0+computecanada requests==2.28.2+computecanada retry==0.9.2+computecanada six==1.16.0+computecanada smart_open==6.3.0+computecanada smmap==5.0.0+computecanada snakemake==7.3.8+computecanada stopit==1.1.2+computecanada tabulate==0.9.0+computecanada toposort==1.7+computecanada traitlets==5.9.0+computecanada tzdata==2023.3+computecanada urllib3==1.26.15+computecanada wrapt==1.15.0+computecanada yte==1.5.1+computecanada


Here is the config_test.yaml with only the lines with changes:

pipeline_directory: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta

sample_file: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/samp_file.txt

outdir: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset

database: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta_dbs/unmasked_db_v1


Did not modified the Snakemake file


I would appreciate if you could tell me what I am doing wrong.

Best regards,

Hélène

meenachakra commented 1 year ago

Hi! Thanks for your interest in Phanta! I think you have an extra slash at the end of your snakemake command?

hgingras commented 1 year ago

Hi, I have adjusted my command line so there are no confusion, thanks. It is now running for kaken rule for both sample 18 and 19 (here the output example for sample 19):


rule kraken: [Mon Jun 19 10:07:07 2023] Finished job 10. 2 of 21 steps (10%) done Select jobs to execute...


Next rule in line is : rule filter_kraken:

This one I have an error about missing output file?

[Mon Jun 19 10:07:07 2023] rule filter_kraken: input: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report output: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report.species, /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report.filtered, /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report.filtering_decisions.txt jobid: 9 reason: Missing output files: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report.filtered, /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report.filtering_decisions.txt; Input files updated by another job: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample18.krak.report wildcards: samp=sample18 resources: tmpdir=/tmp

The thing is that the file sample18.krak.report.filtering_decisions.txt is actually found in this location:

/home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/classification/intermediate sample18.krak.report
sample18.krak.report.filtered.bracken
sample19.krak.report
sample19.krak.report.filtered.bracken sample18.krak.report.filtered
sample18.krak.report.filtering_decisions.txt
sample19.krak.report.filtered
sample19.krak.report.filtering_decisions.txt

Files in: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/classification

sample18.krak.report_bracken_species.filtered
sample19.krak.report
sample19.krak.report.filtered.bracken.scaled sample18.krak.report
sample18.krak.report.filtered.bracken.scaled
sample19.krak.report_bracken_species.filtered samples_that_failed_bracken.txt

Files in: /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification

sample18.krak
sample18.krak.report
sample19.krak
sample19.krak.report

Is there something I did wrong in the config_test.yaml? See previous note for the set up.

Best regards,

Hélène

hgingras commented 1 year ago

Also, just so you know, when I run the pipeline with snakemake 7.3.8 as it is recommend in the troubleshooting environment creation section I have this issue:

File "/lustre03/project/6078354/helene/Tickets/0194347/Phanta_ENV_2/lib/python3.10/site-packages/snakemake/rules.py", line 1215, in eq return self.name == other.name and self.output == other.output AttributeError: 'str' object has no attribute 'name'

When you look at the phanta_env.yaml it is written :

Now I am using snakemake 7.20.0 and I do not have this issue.

You may need to update the troubleshooting environment creation section...

Best regards,

Hélène

meenachakra commented 1 year ago

Hi! -Thank you for the note about the snakemake version, we will update the section. -I don't think that's an error for the filter_kraken? It's just specifying that's why it will run the job. Is that right? I.e., did you get it to work?

hgingras commented 1 year ago

Hi Meenachakra,

This is the error I get in rule filter_kraken::

/lustre03/project/6078354/helene/Tickets/0194347/Phanta_ENV_3/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1204: RuntimeWarning: invalid value encountered in cast if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all(): /lustre03/project/6078354/helene/Tickets/0194347/Phanta_ENV_3/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1204: RuntimeWarning: invalid value encountered in cast if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all(): Traceback (most recent call last): File "/lustre03/project/6078354/helene/Tickets/0194347/Phanta_ENV_3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'species_level_taxa'

Thanks for your support,

Hélène

hgingras commented 1 year ago

Another question, what is the database required to run the demo? Default database, Prophage masked database or Default database that uses the GTDB taxonomy for bacteria and Archaea?

Thanks,

Hélène

meenachakra commented 1 year ago

Hello Hélène,

The default database is required to run the demo.

What does your kraken output look like - could you paste the first few lines here for each sample? That will help us figure out what's going on. Thank you!

AlexandreBoulay commented 1 year ago

Hello,

For context, Hélène has been helping me install Phanta.

I've tried with a few different environments already and always get the same or a similar error message with the rule filter_kraken as well.

Here are the first few lines of the kraken output files for the test samples (for the same environment described by Hélène):

sample18.krak.report:

0.05 3191 3191 U 0 unclassified 99.95 6588513 224501 R 1 root 93.36 6153712 3985 R1 131567 cellular organisms 93.29 6149722 192068 D 2 Bacteria 54.78 3611150 85105 D1 1783272 Terrabacteria group 45.81 3019883 219982 P 1239 Firmicutes 19.85 1308617 806 C 186801 Clostridia 19.84 1307809 218263 O 186802 Clostridiales 6.85 451499 75952 F 541000 Ruminococcaceae 3.27 215599 38835 G 216851 Faecalibacterium 2.53 166791 90600 S 853 Faecalibacterium prausnitzii 1.10 72289 47112 S1 411483 Faecalibacterium prausnitzii A2-165 0.32 21374 21374 S2 3000290 HumGut_290 0.04 2630 2630 S2 3001562 HumGut_1562 0.00 57 57 S2 3000758 HumGut_758

sample18.krak:

C kraken:taxid|3022049|HumGut_22049_1-2 638301 150|150 186826:20 0:17 186826:1 0:11 186826:3 2:4 186826:21 0:13 186826:2 0:24 |:| 2:19 638301:12 0:9 638301:5 0:23 186826:1 638301:9 0:38 C kraken:taxid|3022049|HumGut_22049_2-2 3022049 150|150 3022049:12 186826:29 3022049:16 186826:8 3022049:12 186826:5 3022049:1 186826:7 1239:1 117563:7 186826:8 3022049:10 |:| 3022049:19 638301:5 3022049:4 638301:1 3022049:5 0:62 638301:10 0:10 C kraken:taxid|3022049|HumGut_22049_3-3 638301 150|150 186826:19 638301:4 0:38 638301:2 0:33 638301:5 0:3 638301:9 0:3 |:| 638301:26 0:10 638301:5 0:35 638301:2 0:16 638301:9 0:2 638301:2 0:9 C kraken:taxid|3022049|HumGut_22049_3-1 638301 150|150 638301:10 186826:1 0:7 186826:5 0:20 638301:39 3022049:11 0:23 |:| 638301:31 0:1 638301:5 0:15 638301:17 186826:7 638301:10 0:30 C kraken:taxid|3022049|HumGut_22049_4-3 638301 150|150 638301:72 186826:4 638301:5 3026396:5 0:22 638301:8 |:| 638301:7 186826:5 0:22 638301:2 0:5 638301:4 0:15 638301:4 0:20 638301:5 0:9 638301:11 0:7 C kraken:taxid|3022049|HumGut_22049_4-1 638301 150|150 638301:17 186826:8 638301:35 0:56 |:| 638301:50 186826:33 638301:3 0:30 C kraken:taxid|3022049|HumGut_22049_5-11 186826 150|150 2:9 1783272:4 186826:35 2:4 186826:54 0:9 638301:1 |:| 186826:1 0:15 2:1 0:2 186826:10 2:1 186826:23 638301:8 0:5 186826:3 638301:2 0:45 C kraken:taxid|3022049|HumGut_22049_5-9 186826 150|150 2:14 1:7 2:19 0:9 186826:5 0:3 186826:5 0:18 1:2 0:8 186826:17 0:9 |:| 186826:25 0:1 186826:5 0:23 638301:10 186826:3 638301:5 0:13 186826:2 0:5 186826:4 0:7 2:1 186826:10 2:2 C kraken:taxid|3022049|HumGut_22049_5-7 186826 150|150 2:8 1:5 186826:29 2:16 1:8 2:5 1:8 0:8 1:5 0:7 1:2 2:5 0:3 2:7 |:| 0:11 1:9 2:26 186826:7 0:63 C kraken:taxid|3022049|HumGut_22049_5-5 186826 150|150 1:2 2:26 186826:25 0:9 186826:2 0:42 1:4 0:6 |:| 0:11 186826:5 0:11 1:7 2:9 1:8 2:5 1:8 0:1 1:7 0:44 C kraken:taxid|3022049|HumGut_22049_5-3 2 150|150 1:51 0:18 1:1 0:3 1:5 0:5 1:1 0:1 1:7 2:9 1:3 2:1 1:11 |:| 0:3 2:5 0:5 2:3 1:5 2:11 1:13 0:16 2:2 0:31 3011055:1 1:11 2:10 C kraken:taxid|3022049|HumGut_22049_5-1 638301 150|150 2:2 186826:16 638301:6 186826:2 638301:10 186826:5 638301:5 0:5 186826:5 0:18 638301:5 0:3 638301:2 0:19 638301:2 0:11 |:| 0:7 1:3 0:6 2:7 1:1 2:7 0:25 2:1 0:7 186826:6 2:5 186826:6 2:4 0:19 2:5 0:4 2:3 C kraken:taxid|3022049|HumGut_22049_6-280 3022049 150|150 3022049:5 0:27 638301:1 0:5 186826:34 0:5 186826:1 0:38 |:| 186826:2 0:31 638301:5 3022049:1 638301:1 3022049:7 638301:6 3022049:22 638301:5 0:36 C kraken:taxid|3022049|HumGut_22049_6-278 3022049 150|150 3022049:63 1:2 3022049:18 0:5 3022049:3 0:1 3022049:9 0:15 |:| 0:6 3022049:13 0:30 3022049:5 0:1 3022049:2 0:9 3022049:7 0:43 C kraken:taxid|3022049|HumGut_22049_6-276 3022049 150|150 3022049:60 0:5 3022049:3 0:1 3022049:9 0:2 3022049:3 0:33 |:| 3022049:34 0:82

sample19.krak.report:

0.58 38187 38187 U 0 unclassified 99.42 6567731 112023 R 1 root 93.14 6152760 5247 R1 131567 cellular organisms 93.05 6146638 138289 D 2 Bacteria 64.23 4243272 45617 D1 1783272 Terrabacteria group 44.50 2939663 199085 P 1239 Firmicutes 25.51 1685140 2593 C 186801 Clostridia 25.47 1682542 333631 O 186802 Clostridiales 11.77 777759 167129 F 541000 Ruminococcaceae 6.09 402040 152694 G 216851 Faecalibacterium 2.92 193044 130902 S 853 Faecalibacterium prausnitzii 0.72 47245 43158 S1 411483 Faecalibacterium prausnitzii A2-165 0.02 1288 1288 S2 3001513 HumGut_1513 0.01 975 975 S2 3001557 HumGut_1557 0.01 492 492 S2 3000193 HumGut_193

sample19.krak

C kraken:taxid|3012999|HumGut_12999_3-29 1239 150|150 1239:40 2:1 3028226:5 0:52 1239:5 0:13 |:| 29466:2 1239:2 29466:4 0:5 1239:3 0:5 1239:4 0:71 1239:4 0:6 1239:5 0:5 C kraken:taxid|3012999|HumGut_12999_3-27 29466 150|150 1239:8 29465:5 1239:3 29466:4 29465:10 29466:6 1239:5 29465:1 29466:20 1239:1 29466:5 1239:11 29466:2 1239:8 29466:13 0:14 |:| 29466:5 0:29 29466:1 0:13 29466:13 1239:10 0:3 1239:4 0:38 C kraken:taxid|3012999|HumGut_12999_3-25 29466 150|150 29466:68 0:5 29466:1 0:22 29466:11 0:5 29466:4 |:| 29465:51 0:4 29465:5 0:56 C kraken:taxid|3012999|HumGut_12999_3-23 29466 150|150 29466:23 29465:54 0:38 29465:1 |:| 29466:8 29465:5 29466:3 29465:10 0:15 2:1 0:8 2:2 0:3 2:21 29465:5 2:1 29465:7 0:27 C kraken:taxid|3012999|HumGut_12999_3-21 29465 150|150 2:34 29465:5 2:2 29465:60 0:1 29465:7 0:3 29465:4 |:| 29466:5 0:45 3012999:5 0:2 3012999:7 29465:2 29466:5 29465:8 0:11 29465:5 0:21 C kraken:taxid|3012999|HumGut_12999_3-19 29465 150|150 29465:71 0:5 29465:4 0:10 29465:1 0:5 29465:5 0:15 |:| 29466:9 29465:7 29466:1 29465:5 29466:1 29465:2 0:10 29465:3 0:9 29465:5 0:3 29465:7 0:19 29465:5 0:9 29465:4 0:17 C kraken:taxid|3012999|HumGut_12999_3-17 29465 150|150 29465:8 1239:3 29465:6 1239:9 29465:4 1239:5 29466:3 3013042:2 29465:3 3013042:5 29465:4 0:19 29465:4 0:9 29465:2 0:4 1384081:2 0:5 29465:7 0:12 |:| 29465:38 0:5 1239:5 0:4 29465:2 0:62 C kraken:taxid|3012999|HumGut_12999_3-15 29466 150|150 29466:12 29465:6 29466:41 29465:5 29466:2 0:36 29466:5 0:1 29466:1 0:7 |:| 29466:11 29465:1 29466:8 29465:17 29466:1 29465:2 0:27 29465:1 0:2 29465:4 0:10 29465:4 0:2 29465:1 0:25 C kraken:taxid|3012999|HumGut_12999_3-13 29465 150|150 29465:69 0:36 29466:5 29465:5 0:1 |:| 31977:1 0:20 29465:5 0:5 29465:5 29466:3 0:77 C kraken:taxid|3012999|HumGut_12999_3-11 29466 150|150 29465:21 2:5 29465:31 29466:8 909932:3 29466:5 29465:2 2:5 29465:1 1783272:2 29465:3 29466:5 29465:10 29466:7 29465:6 0:2 |:| 0:23 29465:17 0:76 C kraken:taxid|3012999|HumGut_12999_3-9 29465 150|150 29465:91 0:25 |:| 0:13 29465:3 0:4 29465:44 0:5 29465:5 0:1 29465:5 0:1 29465:1 0:34 C kraken:taxid|3012999|HumGut_12999_3-7 29466 150|150 29466:39 0:51 29466:5 0:10 29466:11 |:| 2:28 29466:2 0:53 2:3 0:30 C kraken:taxid|3012999|HumGut_12999_3-5 29466 150|150 29466:7 1:1 29466:8 0:9 29466:5 0:3 29466:5 0:5 29466:2 0:5 29466:17 29465:4 0:1 29465:7 0:9 29465:7 0:21 |:| 29466:25 29465:33 2:1 29465:10 29466:9 0:1 29466:5 0:7 29466:1 0:24 C kraken:taxid|3012999|HumGut_12999_3-3 29465 150|150 0:12 29465:48 0:37 29466:3 0:16 |:| 1239:34 0:1 3011747:2 1239:5 0:57 29466:4 0:5 29465:8 C kraken:taxid|3012999|HumGut_12999_3-1 29465 150|150 29465:18 0:37 29465:1 0:3 29465:29 0:11 29465:3 0:7 29465:1 0:6 |:| 0:2 29466:1 0:10 29465:4 0:1 29465:2 0:8 29465:12 29466:13 0:3 29466:5 0:7 29466:3 0:14 29465:3 0:3 29465:1 0:5 29465:1 0:18

Thanks for the help, Alex

meenachakra commented 1 year ago

Thank you for the context and these helpful details! Will think and respond soon.

meenachakra commented 1 year ago

Your krak report results look correct. Looks like line 94 in filter_kraken_reports.py (https://github.com/bhattlab/phanta/blob/main/pipeline_scripts/filter_kraken_reports.py) is not working. It's supposed to be creating a new column called species_level_taxa but instead it's somehow maybe trying to access it.

Looks like you have a higher version of pandas than we used. What happens if you try to use 1.4.3 instead?

hgingras commented 1 year ago

Dear Meenachakra, here is what I have installed in Alex environment :

pip install --no-index numpy==1.24.2 pip install --no-index pandas~=1.4.0 pip install snakemake==7.20.0

Here are other modules:

python/3.10.2 r/4.2.2 kraken2/2.1.2 bracken/2.6.0

And here is the pip list:

appdirs 1.4.4+computecanada attrs 21.4.0+computecanada certifi 2022.12.7+computecanada charset_normalizer 3.1.0+computecanada colorama 0.4.6 ConfigArgParse 1.5.3+computecanada connection_pool 0.0.3+computecanada datrie 0.8.2+computecanada docutils 0.19+computecanada dpath 2.0.5+computecanada fastjsonschema 2.16.3+computecanada gitdb 4.0.10+computecanada GitPython 3.1.31+computecanada humanfriendly 10.0+computecanada idna 3.4+computecanada Jinja2 3.1.2+computecanada jsonschema 4.17.3+computecanada jupyter_core 5.2.0+computecanada MarkupSafe 2.1.2+computecanada more-itertools 8.14.0 nbformat 5.7.3+computecanada numpy 1.24.2+computecanada pandas 1.4.1+computecanada pip 23.1.2 plac 1.3.5+computecanada platformdirs 3.2.0+computecanada psutil 5.9.4+computecanada PuLP 2.6.0+computecanada Pygments 2.15.1+computecanada pyrsistent 0.19.3+computecanada python-dateutil 2.8.2+computecanada pytz 2023.3+computecanada PyYAML 6.0+computecanada requests 2.28.2+computecanada reretry 0.11.8 setuptools 67.7.2 six 1.16.0+computecanada smart_open 6.3.0+computecanada smmap 5.0.0+computecanada snakeboost 0.3.0 snakemake 7.20.0 stopit 1.1.2+computecanada tabulate 0.9.0+computecanada throttler 1.2.2 toposort 1.7+computecanada traitlets 5.9.0+computecanada tzdata 2023.3+computecanada urllib3 1.26.15+computecanada wheel 0.40.0 wrapt 1.15.0+computecanada yte 1.5.1+computecanada

hgingras commented 1 year ago

I just tried again with :

numpy 1.23.2+computecanada pandas 1.4.3+computecanada

Full pip list:

appdirs 1.4.4+computecanada attrs 21.4.0+computecanada certifi 2022.12.7+computecanada charset_normalizer 3.1.0+computecanada colorama 0.4.6 ConfigArgParse 1.5.3+computecanada connection_pool 0.0.3+computecanada datrie 0.8.2+computecanada docutils 0.19+computecanada dpath 2.0.5+computecanada fastjsonschema 2.16.3+computecanada gitdb 4.0.10+computecanada GitPython 3.1.31+computecanada humanfriendly 10.0+computecanada idna 3.4+computecanada Jinja2 3.1.2+computecanada jsonschema 4.17.3+computecanada jupyter_core 5.2.0+computecanada MarkupSafe 2.1.2+computecanada more-itertools 8.14.0 nbformat 5.7.3+computecanada numpy 1.23.2+computecanada pandas 1.4.3+computecanada pip 23.1.2 plac 1.3.5+computecanada platformdirs 3.2.0+computecanada psutil 5.9.4+computecanada PuLP 2.6.0+computecanada Pygments 2.15.1+computecanada pyrsistent 0.19.3+computecanada python-dateutil 2.8.2+computecanada pytz 2023.3+computecanada PyYAML 6.0+computecanada requests 2.28.2+computecanada reretry 0.11.8 setuptools 67.7.2 six 1.16.0+computecanada smart_open 6.3.0+computecanada smmap 5.0.0+computecanada snakeboost 0.3.0 snakemake 7.20.0 stopit 1.1.2+computecanada tabulate 0.9.0+computecanada throttler 1.2.2 toposort 1.7+computecanada traitlets 5.9.0+computecanada tzdata 2023.3+computecanada urllib3 1.26.15+computecanada wheel 0.40.0 wrapt 1.15.0+computecanada yte 1.5.1+computecanada

We still have KeyError: 'species_level_taxa'

Traceback (most recent call last): File "/lustre03/project/6078354/helene/Tickets/0194347/Phanta_ENV_3/lib/python3.10/site-packages/pandas/core/frame.py", line 3799, in _set_item_mgr return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'species_level_taxa'

meenachakra commented 1 year ago

Hmm, sorry to hear that! Maybe you can try to run the filter_kraken_reports.py script manually on each report, to see if you still get the error? This is the command to run (from the Snakefile):

python {params.repo_dir}/pipeline_scripts/filter_kraken_reports.py {input.krak_report} {params.db} \
    {params.cov_thresh_bacterial} {params.cov_thresh_viral} {params.minimizer_thresh_bacterial} \
    {params.minimizer_thresh_viral} \
    {params.cov_thresh_arc} {params.cov_thresh_euk} {params.minimizer_thresh_arc} \
    {params.minimizer_thresh_euk}

You'll have to replace all the values in brackets - the coverage/minimizer thresholds are the ones specified in your config file.

You should still try in the environment that you're loading.

hgingras commented 1 year ago

Hello Meenachakra,

The script:

python /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/pipeline_scripts/filter_kraken_reports.py /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/test_dataset/classification/sample19.krak.report /home/helene/projects/def-helene/helene/Tickets/0194347/phanta_dbs/unmasked_db_v1 0.01 0.1 0 0 0.01 0 0 0

Error message:

Traceback (most recent call last): File "/home/helene/projects/def-helene/helene/Tickets/0194347/phanta/pipeline_scripts/filter_kraken_reports.py", line 94, in species_kraken['species_level_taxa'] = species_kraken.apply(lambda x: taxid_to_desired_rank(str(x['ncbi_taxonomy']), 'species', child_parent, taxid_rank), axis=1) File "/home/helene/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3968, in setitem self._set_item_frame_value(key, value) File "/home/helene/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 4123, in _set_item_frame_value raise ValueError( ValueError: Cannot set a DataFrame with multiple columns to the single column species_level_taxa

Thanks for your help,

Hélène

meenachakra commented 1 year ago

Interesting! Could you make a copy of the script and replace species_kraken['species_level_taxa'] with test and then add a line that prints test? Then please test out the copy. Seems like the apply function is not working for some reason, not sure why

hgingras commented 1 year ago

Hello Meenachakra,

I found something if I use phanta/test_dataset/classification/sample19.krak.report I get the error of DataFrame. When I use phanta/testing/classification/sample19.krak.report from the testing folder, I do not have an error. These two files do not have the same column numbers. Is this info leading to some insight on your side?

From testing/classification/sample19.krak.report we have 8 columns:

0.58 38187 38187 0 0 U 0 unclassified 99.42 6567731 112023 262847837 52674695 R 1 root 93.14 6152760 5247 243733167 51177499 R1 131567 cellular organisms 93.05 6146638 138289 243267882 50791588 D 2 Bacteria 64.23 4243272 45617 168588127 35106668 D1 1783272 Terrabacteria group 44.50 2939663 199085 113516718 26961606 P 1239 Firmicutes 25.51 1685140 2593 59307662 17113555 C 186801 Clostridia 25.47 1682542 333631 58859090 16872209 O 186802 Clostridiales 11.77 777759 167129 23625367 6563956 F 541000 Ruminococcaceae 6.09 402040 152694 10041499 2756044 G 216851 Faecalibacterium 2.92 193044 130902 4201973 1278127 S 853 Faecalibacterium prausnitzii 0.72 47245 43158 981810 347471 S1 411483 Faecalibacterium prausnitzii A2-165

From test_dataset/classification/sample19.krak.report we have 6 columns:

0.58 38187 38187 U 0 unclassified 99.42 6567731 112023 R 1 root 93.14 6152760 5247 R1 131567 cellular organisms 93.05 6146638 138289 D 2 Bacteria 64.23 4243272 45617 D1 1783272 Terrabacteria group 44.50 2939663 199085 P 1239 Firmicutes 25.51 1685140 2593 C 186801 Clostridia 25.47 1682542 333631 O 186802 Clostridiales 11.77 777759 167129 F 541000 Ruminococcaceae 6.09 402040 152694 G 216851 Faecalibacterium 2.92 193044 130902 S 853 Faecalibacterium prausnitzii 0.72 47245 43158 S1 411483 Faecalibacterium prausnitzii A2-165

Best regards,

Helene

hgingras commented 1 year ago

I have been trying to download other databases but not able to. Anything wrong with the severs? I was using the database suggested in the Quick start section. http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/database_V1.tar.gz

We have 6 columns here. First few lines in this file phanta_dbs/unmasked_db_v1/inspect.out

0.00 0 0 U 0 unclassified 100.00 5657704547 83283590 R 1 root 90.98 5147592210 2795098 R1 131567 cellular organisms 69.58 3936411473 63905134 D 2 Bacteria 44.92 2541363642 13519023 D1 1783272 Terrabacteria group 38.85 2198170416 72921856 P 1239 Firmicutes 26.88 1520611030 2145516 C 186801 Clostridia 26.59 1504398150 86242030 O 186802 Clostridiales 6.71 379420572 17289207 F 541000 Ruminococcaceae 1.51 85357482 0 F1 541002 environmental samples 1.51 85357482 24512736 S 541003 uncultured Ruminococcaceae bacterium

I guess it has to do with the database... the script may not be universal?

meenachakra commented 1 year ago

Oh, that's very informative that the column numbers are different! Didn't notice that in the first glance as I just looked at the first few and last columns. That indicates that your kraken isn't reporting unique minimizer data, which is required for the filtering step (please see our paper).

Could you try running just kraken in your environment?

This is the command to run (from the Snakefile)

kraken2 --db {params.db} --threads {threads} --output {output.krak} \
--report {output.krak_report} --report-minimizer-data {params.paired_string} \
{params.gzipped_string} {input.reads} --confidence {params.confidence_threshold}

You need to replace everything in brackets.

meenachakra commented 1 year ago

And - thank you for letting us know about the servers! Indeed, some of the servers at our university are currently down till Monday for maintenance - we should indicate that on the README! Thanks!

hgingras commented 1 year ago

Dear Meenachakra:

Run this:

kraken2 --db /home/helene/projects/def-helene/helene/Tickets/0194347/phanta_dbs/unmasked_db_v1 --threads 1 --output output.krak --report report.output.krak --report-minimizer-data --paired --gzip-compressed /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/sample18_R1.fastq.gz /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/sample18_R2.fastq.gz --confidence 0.1

Got this message: Unknown option: report-minimizer-data Loading database information...classify: Error reading in hash table

Run again (without --report-minimizer-data): kraken2 --db /home/helene/projects/def-helene/helene/Tickets/0194347/phanta_dbs/unmasked_db_v1 --threads 1 --output output.krak --report report.output.krak --paired --gzip-compressed /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/sample18_R1.fastq.gz /home/helene/projects/def-helene/helene/Tickets/0194347/phanta/testing/sample18_R2.fastq.gz --confidence 0.1

Got this message: Loading database information...classify: Error reading in hash table

meenachakra commented 1 year ago

Thanks! The error reading in hash table is because you don't have enough available memory.

You need at least 32GB memory available to run the command (which you must have available on your system, since you ran kraken before successfully with the Snakefile). Could you try to specify this memory requirement when you run the command?

For the unknown option: that's strange! And is probably the source of the entire issue you're having.

It's a relatively new feature in Kraken2... but 2.1.2 is indeed the version we used.

Could you try to not load Kraken2 in your environment and just load bracken (and use v2.7 rather than 2.6.0 for that)? As far as I recall, installing Bracken installs Kraken2 by default...

hgingras commented 1 year ago

Dear Meenachakra,

I ran again with bracken v.2.7 and it worked perfectly.

Thanks again for all your help.

Hélène

meenachakra commented 1 year ago

Perfect!! That's great to hear. Good luck running Phanta on your samples.