bayraktar1 / SRA-Data-Collector

Snakemake pipeline for finding samples on NCBI through taxons and accessions
MIT License
0 stars 0 forks source link

Questions about adding inputs #4

Closed lane66 closed 3 months ago

lane66 commented 4 months ago

Hello, I have a question about how to input data into this tool. For example, I have a set of run accessions that I want to query. How should I go about doing this? Should I add them to the 'accessions.txt' file? What would be the correct format for this file?

Thank you very much for your help.

bayraktar1 commented 4 months ago

Hi,

Yes, you should add all the accessions to the accession.txt. They should be on one line and only be separated by spaces.

lane66 commented 4 months ago

Thank you for your reply!. I followed your instructions to input data into accessions.txt and taxons.txt, but encountered an error during the query sample step. The following is the content of query_ncbi.log:

Loading required package: RSQLite Loading required package: graph Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: RCurl Setting options('download.file.method.GEOquery'='auto') Setting options('GEOquery.inmemory.gpl'=FALSE) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ stringr::boundary() masks graph::boundary() ✖ dplyr::combine() masks BiocGenerics::combine() ✖ tidyr::complete() masks RCurl::complete() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position() ℹ Use the conflicted package (http://conflicted.r-lib.org/) to force all conflicts to become errors Database file specified. Error in map(): ℹ In index: 1. Caused by error in down[[as.character(taxon_id)]][["childtaxa_id"]]: ! subscript out of bounds Backtrace: ▆

  1. ├─... %>% paste(collapse = ", ")
  2. ├─BiocGenerics::paste(., collapse = ", ")
  3. │ └─BiocGenerics (local) standardGeneric("paste")
  4. │ ├─BiocGenerics::eval(quote(list(...)), env)
  5. │ └─base::eval(quote(list(...)), env)
  6. │ └─base::eval(quote(list(...)), env)
  7. ├─base::unlist(.)
  8. ├─purrr::map(., check_rank)
  9. │ └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
    1. │ ├─purrr:::with_indexed_errors(...)
    2. │ │ └─base::withCallingHandlers(...)
    3. │ ├─purrr:::call_with_cleanup(...)
    4. │ └─global .f(.x[[i]], ...)
    5. │ └─base::unlist(down[[as.character(taxon_id)]][["childtaxa_id"]])
    6. └─purrr (local) <fn>(<sbscOOBE>)
    7. └─cli::cli_abort(...)
    8. └─rlang::abort(...) Execution halted
bayraktar1 commented 4 months ago

Could you provide the content of your taxons.txt and accessions.txt

lane66 commented 4 months ago

Hi These are the content of accessions.txt

These are the content of taxons.txt: 83334

bayraktar1 commented 3 months ago

Hi @lane66,

This error is caused by the 83334 taxon ID, which belongs to the NCBI rank “serotype”. Currently, the pipeline only works with the species rank and above. I will improve this soon so that the tool can also retrieve ranks like “strain” and “serotype”. If you still want to download the accessions, you can leave the taxons.txt empty for now.

lane66 commented 3 months ago

Thank you very much for your help! I have cleared the contents of taxons.txt and retained the contents of accessions.txt, but it still didn't work.

Error message: Assuming unrestricted shared filesystem usage for local execution. Building DAG of jobs... Creating conda environment workflow/envs/stats_notebook.yml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/envs/statsnotebook.yml created (location: .snakemake/conda/efe391711166ea8c94168e241b824d97) Creating conda environment workflow/envs/Renv.yml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/workflow/envs/Renv.yml created (location: .snakemake/conda/3632384e6b33db068b67efa1292d052b) Creating conda environment workflow/envs/metadata_notebook.yaml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/envs/metadatanotebook.yaml created (location: .snakemake/conda/172d91faeddf74aca764e0e713a528e2) Using shell: /usr/bin/bash Provided cores: 16 Rules claiming more threads will be scaled down. Job stats: job count


all 1 download_SRAdb 1 platform_stats 1 query_ncbi 1 wrangle_metadata 1 total 5

Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:37:56 2024] localrule download_SRAdb: output: Data/SRAmetadb.sqlite log: logs/download_db/download_SRAdb.log jobid: 3 reason: Missing output files: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

( wget https://gbnci.cancer.gov/backup/SRAmetadb.sqlite.gz -P Data/ && gzip -d Data/SRAmetadb.sqlite.gz ) >logs/download_db/download_SRAdb.log 2>&1 [Mon May 20 12:54:24 2024] Finished job 3. 1 of 5 steps (20%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:54:24 2024] localrule query_ncbi: input: Data/SRAmetadb.sqlite output: results/SRA.feather log: logs/query_ncbi/query_ncbi.log jobid: 2 reason: Missing output files: results/SRA.feather; Input files updated by another job: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

    (workflow/scripts/retrieve_NCBI_metadata.R \
        --database Data/SRAmetadb.sqlite \
        --taxon_id_file Data/taxons.txt \
        --accession_file Data/accessions.txt \
        --output results/SRA.feather) >logs/query_ncbi/query_ncbi.log 2>&1

Activating conda environment: .snakemake/conda/3632384e6b33db068b67efa1292d052b_ [Mon May 20 12:56:41 2024] Finished job 2. 2 of 5 steps (40%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:56:41 2024] localrule wrangle_metadata: input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb jobid: 1 reason: Missing output files: results/metadata.csv; Input files updated by another job: results/SRA.feather resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1, mem_mb=8000, mem_mib=7630, max_mb=16000

Activating conda environment: .snakemake/conda/172d91faeddf74aca764e0e713a528e2_ 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation. Traceback (most recent call last): File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/bin/jupyter-nbconvert", line 11, in sys.exit(main()) ^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/jupyter_core/application.py", line 283, in launch_instance super().launch_instance(argv=argv, kwargs) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance app.start() File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 420, in start self.convert_notebooks() File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 597, in convert_notebooks self.convert_single_notebook(notebook_filename) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 563, in convert_single_notebook output, resources = self.export_single_notebook( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 487, in export_single_notebook output, resources = self.exporter.from_filename( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 201, in from_filename return self.from_file(f, resources=resources, kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 220, in from_file return self.from_notebook_node( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/notebook.py", line 36, in from_notebook_node nb_copy, resources = super().from_notebook_node(nb, resources, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 154, in from_notebook_node nb_copy, resources = self._preprocess(nb_copy, resources) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 353, in _preprocess nbc, resc = preprocessor(nbc, resc) ^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/base.py", line 48, in call return self.preprocess(nb, resources) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/execute.py", line 103, in preprocess self.preprocess_cell(cell, resources, index) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/execute.py", line 124, in preprocess_cell cell = self.execute_cell(cell, index, store_history=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/jupyter_core/utils/init.py", line 165, in wrapped return loop.run_until_complete(inner) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbclient/client.py", line 1062, in async_execute_cell await self._check_raise_for_error(cell, cell_index, exec_reply) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbclient/client.py", line 918, in _check_raise_for_error raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content) nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:

file_path = '../../results/SRA.feather'

data = feather.read_feather(file_path)

data = feather.read_feather(snakemake.input[0])

metadata_df = pd.DataFrame(data) metadata_df = metadata_df.convert_dtypes() metadata_df.set_index('run_accession', inplace=True)

print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---') metadata_df.head()


KeyError Traceback (most recent call last) /scratch/18864336/ipykernel_2813461/1824482230.py in ?() 3 data = feather.read_feather(snakemake.input[0]) 4 5 metadata_df = pd.DataFrame(data) 6 metadata_df = metadata_df.convert_dtypes() ----> 7 metadata_df.set_index('run_accession', inplace=True) 8 9 print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---') 10 metadata_df.head()

/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/pandas/core/frame.py in ?(self, keys, drop, append, inplace, verify_integrity) 6118 if not found: 6119 missing.append(col) 6120 6121 if missing: -> 6122 raise KeyError(f"None of {missing} are in the columns") 6123 6124 if inplace: 6125 frame = self

KeyError: "None of ['run_accession'] are in the columns"

RuleException: CalledProcessError in file /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/metadata.smk, line 76: Command 'source /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/envs/panaroo/bin/activate '/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2'; set -euo pipefail; jupyter-nbconvert --log-level ERROR --execute --output /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/logs/wrangle_metadata/processed_notebook.ipynb --to notebook --ExecutePreprocessor.timeout=-1 /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/.snakemake/scripts/tmphyfhr_xy.wrangle_NCBI_metadata.py.ipynb' returned non-zero exit status 1. [Mon May 20 12:56:48 2024] Error in rule wrangle_metadata: jobid: 1 input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb (check log file(s) for error details) conda-env: /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-20T122915.670595.snakemake.log WorkflowError: At least one job did not complete successfully.

bayraktar1 commented 3 months ago

Please make an effort to properly format the logs and include only the relevant parts.

lane66 commented 3 months ago

Hi This is the relevant content in the log.

[Mon May 20 12:54:24 2024] Finished job 3. 1 of 5 steps (20%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:54:24 2024] localrule query_ncbi: input: Data/SRAmetadb.sqlite output: results/SRA.feather log: logs/query_ncbi/query_ncbi.log jobid: 2 reason: Missing output files: results/SRA.feather; Input files updated by another job: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

    (workflow/scripts/retrieve_NCBI_metadata.R \
        --database Data/SRAmetadb.sqlite \
        --taxon_id_file Data/taxons.txt \
        --accession_file Data/accessions.txt \
        --output results/SRA.feather) >logs/query_ncbi/query_ncbi.log 2>&1

Activating conda environment: .snakemake/conda/3632384e6b33db068b67efa1292d052b_ [Mon May 20 12:56:41 2024] Finished job 2. 2 of 5 steps (40%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:56:41 2024] localrule wrangle_metadata: input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb jobid: 1 reason: Missing output files: results/metadata.csv; Input files updated by another job: results/SRA.feather resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1, mem_mb=8000, mem_mib=7630, max_mb=16000

Activating conda environment: .snakemake/conda/172d91faeddf74aca764e0e713a528e2_ RuleException: CalledProcessError in file /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/metadata.smk, line 76: Command 'source /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/envs/panaroo/bin/activate '/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snake make/conda/172d91faeddf74aca764e0e713a528e2'; set -euo pipefail; jupyter-nbconvert --log-level ERROR --execute --output /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/rec onstruct_plasmids_snakemake/logs/wrangle_metadata/processed_notebook.ipynb --to notebook --ExecutePreprocessor.timeout=-1 /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/rec onstruct_plasmids_snakemake/.snakemake/scripts/tmphyfhr_xy.wrangle_NCBI_metadata.py.ipynb' returned non-zero exit status 1. [Mon May 20 12:56:48 2024] Error in rule wrangle_metadata: jobid: 1 input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb (check log file(s) for error details) conda-env: /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-20T122915.670595.snakemake.log WorkflowError: At least one job did not complete successfully.

bayraktar1 commented 3 months ago

This error occurs because the SRA.feather file is empty. This means that the database did not contain any of the accessions you submitted.

I checked the database manually for a couple of samples you provided, and they were not present. They do, however, seem to be findable on the NCBI website. The studies related to the runs seem to be in the database as well. For example, SRR7850007 is part of the SRP071789 study, and that study has 500 runs in the database.

So this seems to be an issue with the NCBI and the SRA database dump, which I cannot do anything about. Maybe the accession for these studies were updated recently, and the database dump has not been updated by the NCBI yet. If all the accessions are from the same studies, you could try using the study accession instead.