bayraktar1 / SRA-Data-Collector

Snakemake pipeline for finding samples on NCBI through taxons and accessions
MIT License
0 stars 0 forks source link

Questions about adding inputs #4

Closed lane66 closed 3 months ago

lane66 commented 4 months ago

Hello, I have a question about how to input data into this tool. For example, I have a set of run accessions that I want to query. How should I go about doing this? Should I add them to the 'accessions.txt' file? What would be the correct format for this file?

Thank you very much for your help.

bayraktar1 commented 4 months ago

Hi,

Yes, you should add all the accessions to the accession.txt. They should be on one line and only be separated by spaces.

lane66 commented 4 months ago

Thank you for your reply!. I followed your instructions to input data into accessions.txt and taxons.txt, but encountered an error during the query sample step. The following is the content of query_ncbi.log:

Loading required package: RSQLite Loading required package: graph Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: RCurl Setting options('download.file.method.GEOquery'='auto') Setting options('GEOquery.inmemory.gpl'=FALSE) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ stringr::boundary() masks graph::boundary() ✖ dplyr::combine() masks BiocGenerics::combine() ✖ tidyr::complete() masks RCurl::complete() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position() ℹ Use the conflicted package (http://conflicted.r-lib.org/) to force all conflicts to become errors Database file specified. Error in map(): ℹ In index: 1. Caused by error in down[[as.character(taxon_id)]][["childtaxa_id"]]: ! subscript out of bounds Backtrace: ▆

  1. ├─... %>% paste(collapse = ", ")
  2. ├─BiocGenerics::paste(., collapse = ", ")
  3. │ └─BiocGenerics (local) standardGeneric("paste")
  4. │ ├─BiocGenerics::eval(quote(list(...)), env)
  5. │ └─base::eval(quote(list(...)), env)
  6. │ └─base::eval(quote(list(...)), env)
  7. ├─base::unlist(.)
  8. ├─purrr::map(., check_rank)
  9. │ └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
    1. │ ├─purrr:::with_indexed_errors(...)
    2. │ │ └─base::withCallingHandlers(...)
    3. │ ├─purrr:::call_with_cleanup(...)
    4. │ └─global .f(.x[[i]], ...)
    5. │ └─base::unlist(down[[as.character(taxon_id)]][["childtaxa_id"]])
    6. └─purrr (local) <fn>(<sbscOOBE>)
    7. └─cli::cli_abort(...)
    8. └─rlang::abort(...) Execution halted
bayraktar1 commented 4 months ago

Could you provide the content of your taxons.txt and accessions.txt

lane66 commented 4 months ago

Hi These are the content of accessions.txt: SRR6324145 SRR7474056 SRR7588713 SRR7204340 SRR7278083 SRR6324183 SRR6324274 SRR6324150 SRR6324133 SRR6324134 SRR6324138 SRR6324144 SRR6324147 SRR6324156 SRR6324267 SRR6324285 SRR6324132 SRR7186958 SRR7283830 SRR6324154 SRR7285584 SRR6053024 SRR7283721 SRR7286506 SRR6052695 SRR7866244 SRR6001290 SRR6052749 SRR7215898 SRR7244223 SRR7215914 SRR6001331 SRR6079435 SRR6079436 SRR6052827 SRR7290983 SRR6053036 SRR6324149 SRR7249500 SRR7286585 SRR7407939 SRR6324143 SRR6324151 SRR7836292 SRR7850093 SRR7850110 SRR6052082 SRR7256996 SRR7942548 SRR7167519 SRR7827073 SRR7285437 SRR7186892 SRR7456545 SRR7975963 SRR7975967 SRR7833081 SRR7286247 SRR7286555 SRR7850091 SRR7223133 SRR8237817 SRR6053033 SRR7215572 SRR7839370 SRR7849999 SRR7905909 SRR7850122 SRR7842170 SRR7277753 SRR7250961 SRR7839306 SRR6053035 SRR6053022 SRR7223036 SRR7285519 SRR7832325 SRR6052865 SRR7187908 SRR6052786 SRR7291031 SRR5461667 SRR6001299 SRR6052730 SRR6052757 SRR6052774 SRR6052867 SRR7163929 SRR7209034 SRR7265967 SRR7837071 SRR7215567 SRR7495426 SRR7277726 SRR6053005 SRR6052933 SRR8116851 SRR7833839 SRR7850149 SRR7842058 SRR7401630 SRR7367553 SRR7244359 SRR6052872 SRR7456550 SRR7456566 SRR7285591 SRR7215505 SRR7257105 SRR6052937 SRR7456559 SRR8149162 SRR7826917 SRR7826919 SRR7839324 SRR7215277 SRR7866214 SRR6053004 SRR7842118 SRR6052688 SRR7249774 SRR7230164 SRR6052805 SRR8291876 SRR7516052 SRR7850081 SRR7889987 SRR7367537 SRR7367234 SRR7285439 SRR8116849 SRR7286655 SRR7358278 SRR7827063 SRR7866110 SRR7367528 SRR6053015 SRR6052765 SRR7184268 SRR7368890 SRR7244295 SRR6052936 SRR8270037 SRR7236574 SRR7221212 SRR7230350 SRR6052799 SRR7291549 SRR7265843 SRR7866267 SRR7833076 SRR7285560 SRR7187781 SRR6052925 SRR7285498 SRR6052784 SRR7286168 SRR6079381 SRR8201566 SRR7905880 SRR7892543 SRR8086682 SRR7842174 SRR7905906 SRR7221209 SRR7256954 SRR7407963 SRR7230246 SRR7407961 SRR7367407 SRR7358287 SRR7495425 SRR7495430 SRR7516048 SRR7244215 SRR6052918 SRR6052903 SRR6001311 SRR7282617 SRR6052783 SRR6079449 SRR6052060 SRR7291635 SRR7265962 SRR7223049 SRR6052997 SRR6079420 SRR7257043 SRR7367385 SRR7251455 SRR7276922 SRR7265860 SRR7367577 SRR6053001 SRR7163934 SRR7291024 SRR7186895 SRR7163914 SRR7842060 SRR7230229 SRR7249678 SRR7266820 SRR6001310 SRR6052929 SRR6053006 SRR7223105 SRR8166011 SRR7836311 SRR7842149 SRR7866238 SRR7842069 SRR7358043 SRR7236592 SRR7179946 SRR7265951 SRR7291579 SRR7474041 SRR6052808 SRR7274846 SRR7208779 SRR7256958 SRR6052744 SRR7215923 SRR7256949 SRR7839321 SRR7277870 SRR7866067 SRR7215939 SRR7850002 SRR7826922 SRR6052994 SRR7836302 SRR7842130 SRR7187823 SRR7285471 SRR6052813 SRR8165796 SRR7274814 SRR7358288 SRR7236607 SRR7358049 SRR7474050 SRR8032846 SRR7826971 SRR7827059 SRR6052875 SRR6079438 SRR6052851 SRR6053023 SRR8235466 SRR6052822 SRR7274854 SRR6052795 SRR7163889 SRR7292644 SRR7840168 SRR7522849 SRR6052891 SRR7221107 SRR7221257 SRR6079430 SRR6053037 SRR6079448 SRR7191643 SRR7221123 SRR7249522 SRR7588692 SRR7256994 SRR7256983 SRR7282603 SRR7265941 SRR7367468 SRR6052793 SRR8235463 SRR7277790 SRR7835329 SRR6052996 SRR8185662 SRR7522845 SRR7588722 SRR6052766 SRR7401651 SRR7204320 SRR7892573 SRR7186913 SRR7277708 SRR7285583 SRR7172279 SRR7367391 SRR7892530 SRR7892546 SRR7251416 SRR7426134 SRR7230294 SRR7401624 SRR7367258 SRR7208467 SRR7850120 SRR7839297 SRR7842117 SRR7842145 SRR7407964 SRR6052889 SRR7251026 SRR6001321 SRR7456546 SRR6001344 SRR7286586 SRR7283759 SRR6052788 SRR7516049 SRR7401635 SRR7835540 SRR7244288 SRR6052801 SRR6052834 SRR7221271 SRR6052816 SRR7850162 SRR7850216 SRR7588720 SRR7204333 SRR7285556 SRR7286575 SRR7826936 SRR7842159 SRR7251418 SRR6872955 SRR7274803 SRR7975956 SRR7208993 SRR6052930 SRR7186952 SRR7850156 SRR6052993 SRR6079443 SRR7163893 SRR7221283 SRR7172290 SRR7407915 SRR6872960 SRR6872964 SRR7249752 SRR7274845 SRR7588701 SRR6872963 SRR7285585 SRR6872952 SRR6872965 SRR7266802 SRR7942549 SRR7944811 SRR7975978 SRR7975982 SRR7905903 SRR7187916 SRR7291613 SRR7184361 SRR7943902 SRR7290921 SRR7230210 SRR6001332 SRR7179880 SRR6052868 SRR6052921 SRR6053013 SRR7236681 SRR7291041 SRR6053020 SRR6052919 SRR7184307 SRR7367473 SRR6052894 SRR7439064 SRR7184214 SRR7495419 SRR7265925 SRR7495424 SRR7474048 SRR6052854 SRR7187058 SRR7835546 SRR7944803 SRR7588687 SRR8097468 SRR7833084 SRR7850169 SRR6053029 SRR6052904 SRR7835545 SRR7367360 SRR7191537 SRR7274842 SRR7282651 SRR7204345 SRR7401649 SRR7833097 SRR7249507 SRR7285406 SRR7223139 SRR6052811 SRR7186927 SRR7850078 SRR7191686 SRR7221039 SRR7407937 SRR7456547 SRR7975964 SRR6052927 SRR7251000 SRR7826969 SRR7842092 SRR6001327 SRR7850152 SRR7401617 SRR6052916 SRR7204218 SRR8380047 SRR6052843 SRR8097463 SRR7850112 SRR7285469 SRR8052863 SRR7839312 SRR7286546 SRR7358022 SRR8288158 SRR7850076 SRR7516046 SRR7407911 SRR7842053 SRR7892123 SRR6053040 SRR7266828 SRR8032409 SRR8032390 SRR8032430 SRR8184620 SRR8185663 SRR7186915 SRR7215512 SRR6052820 SRR8304471 SRR7291030 SRR6001333 SRR7842057 SRR7850004 SRR7456548 SRR7223072 SRR6052935 SRR6079385 SRR7890013 SRR7842054 SRR7842110 SRR7842151 SRR7850016 SRR7842150 SRR7866086 SRR7283811 SRR7230206 SRR7456551 SRR7250965 SRR7184306 SRR7291039 SRR6053025 SRR6052998 SRR7850111 SRR7256989 SRR7850175 SRR7842146 SRR6052817 SRR7439047 SRR7367527 SRR7285612 SRR7286370 SRR7286541 SRR7290891 SRR8114780 SRR7842064 SRR7266799 SRR7291578 SRR7221278 SRR7358277 SRR7842107 SRR7850074 SRR7250951 SRR7291921 SRR7944818 SRR7842175 SRR7522835 SRR7474055 SRR7850079 SRR8146041 SRR8086718 SRR8097451 SRR8086699 SRR7850005 SRR6053008 SRR6001314 SRR7204322 SRR7223039 SRR7873471 SRR8293756 SRR7358268 SRR7184236 SRR7265869 SRR6052780 SRR7291626 SRR7283732 SRR6079429 SRR5905747 SRR7187841 SRR6052122 SRR7832333 SRR7230310 SRR6052871 SRR7285551 SRR6816942 SRR7186948 SRR7283770 SRR6052719 SRR6816956 SRR6816951 SRR8032828 SRR7892542 SRR7866066 SRR7850137 SRR7975974 SRR7836295 SRR8086692 SRR8194860 SRR8086715 SRR7456569 SRR7187768 SRR7283765 SRR7358002 SRR6816930 SRR6816935 SRR7944804 SRR6816938 SRR6052924 SRR7826959 SRR7826964 SRR7827064 SRR7850025 SRR7850096 SRR7850176 SRR7826975 SRR7866247 SRR7866296 SRR6825176 SRR8235474 SRR8240972 SRR8257974 SRR7842154 SRR7849998 SRR7407986 SRR7407987 SRR7474059 SRR7286495 SRR7367544 SRR6052890 SRR6821014 SRR7944806 SRR7290922 SRR7367380 SRR6816954 SRR7285564 SRR7286636 SRR6052863 SRR6052743 SRR7230321 SRR7839330 SRR7839341 SRR7866083 SRR6816934 SRR7850007 SRR7826986 SRR7839326 SRR7516050 SRR7208632 SRR7290926 SRR6053018 SRR7866080 SRR6052917 SRR6821015 SRR6053028 SRR8097453 SRR7942553 SRR7892570 SRR7439070 SRR7291623 SRR7282606 SRR7474036 SRR6816939 SRR6079442 SRR6052705 SRR7236433 SRR6816929 SRR7215906 SRR7286548 SRR7842051 SRR7850127 SRR6052907 SRR7849993 SRR7837069 SRR7836294 SRR7832330 SRR7833106 SRR8185530 SRR8185644 SRR8201660 SRR7892524 SRR8361098 SRR7889983 SRR7850098 SRR7836283 SRR7866255 SRR7290966 SRR6816955 SRR7236579 SRR6816936 SRR7277736 SRR7850088 SRR6052928 SRR6052926 SRR6052862 SRR7975958 SRR7850049 SRR6816949 SRR6825174 SRR6052910 SRR6816937 SRR6820993 SRR7204324 SRR6052739 SRR6052773 SRR6052126 SRR8149115 SRR7827049 SRR7184209 SRR7291614 SRR7850154 SRR7184367 SRR7256953 SRR7266794 SRR6052864 SRR6825177 SRR6053044 SRR6079424 SRR8086680 SRR7283828 SRR7187002 SRR7265736 SRR8116854 SRR8116848 SRR8086700 SRR6816945 SRR7826928 SRR7283675 SRR8097436 SRR8117841 SRR8116852 SRR8114791 SRR6052789 SRR7516055 SRR8086683 SRR7842173 SRR7522833 SRR6816928 SRR7839376 SRR7223062 SRR8185660 SRR6820839 SRR8176533 SRR8172592 SRR7850033 SRR7826932 SRR7588707 SRR7588714 SRR7588725 SRR7495427 SRR6052857 SRR6825175 SRR6052879 SRR7285424 SRR7286248 SRR6821009 SRR6816933 SRR6816931 SRR6816941 SRR7905907 SRR7879232 SRR6816950 SRR8159441 SRR6816940 SRR7187790 SRR8258010 SRR8240969 SRR7839322 SRR7890022 SRR7839315 SRR7850118 SRR8288159 SRR6053019 SRR7367230 SRR6079386 SRR6079364 SRR7474043 SRR7358030 SRR7839320 SRR6052821 SRR6052847 SRR7495428 SRR7827043 SRR8185672 SRR7827069 SRR8052158 SRR8032831 SRR8032836 SRR7892555 SRR7283671 SRR7866259 SRR6052830 SRR7286521 SRR7283749 SRR8304512 SRR8148965 SRR7889984 SRR7879235 SRR7833069 SRR7842085 SRR7215509 SRR6052832 SRR7223137 SRR7515979 SRR7456552 SRR6079365 SRR6052121 SRR7367589 SRR7187801 SRR7850037 SRR7401698 SRR8172598 SRR7358274 SRR6053011 SRR7879233 SRR7866107 SRR7367547 SRR7221193 SRR7251437 SRR7251048 SRR7942550 SRR7866065 SRR7842055 SRR7839334 SRR7842112 SRR7588582 SRR7407912 SRR7521263 SRR8032416 SRR7191711 SRR7367218 SRR7905905 SRR8146030 SRR7521182 SRR7277698 SRR6052735 SRR6052741 SRR7367401 SRR7290976 SRR7286578 SRR7367549 SRR7250999 SRR7265740 SRR7221192 SRR7285524 SRR6052814 SRR6052826 SRR6052860 SRR6079419 SRR7367202 SRR7367381 SRR7236633 SRR7291584 SRR6052911 SRR6052124 SRR6052736 SRR7167526 SRR7163915 SRR7221343 SRR7266826 SRR7850158 SRR7833068 SRR7850236 SRR7839340 SRR7367531 SRR7187004 SRR7186954 SRR7358276 SRR6052849 SRR7286253 SRR7892525 SRR7850011 SRR7842089 SRR7291636 SRR6052781 SRR6052807 SRR7850047 SRR6053007 SRR7850018 SRR7291673 SRR7221250 SRR8185653 SRR8176612 SRR8116858 SRR7944812 SRR7892532 SRR7975962 SRR8032847 SRR8270038 SRR6052840 SRR7274961 SRR7659088 SRR7659093 SRR7659094 SRR7659109 SRR7833094 SRR7850195 SRR7836287 SRR7659090 SRR7826973 SRR7839359 SRR7659092 SRR7659110 SRR7659112 SRR7826948 SRR7826970 SRR7659113 SRR7659091 SRR7659107 SRR7842111 SRR7849991 SRR7659087 SRR7215278 SRR7184207 SRR7184382 SRR7186899 SRR7187889 SRR7401634 SRR7850189 SRR7842195 SRR7367584 SRR7244239 SRR8097438 SRR7250960 SRR6052913 SRR7250977 SRR7223127 SRR6079433 SRR7215519 SRR6079415 SRR7244294 SRR7367454 SRR7407935 SRR7588683 SRR6079441 SRR6052878 SRR7285546 SRR7826924 SRR7249679 SRR7942547 SRR7879234 SRR7850020 SRR7838538 SRR7833101 SRR7849996 SRR7474135 SRR7221035 SRR7286588 SRR7286630 SRR7204298 SRR6052823 SRR7179947 SRR6052853 SRR6052895 SRR7187765 SRR6079446 SRR6053031 SRR6053010 SRR6053017 SRR6052869 SRR7367232 SRR7974364 SRR8237813 SRR6079447 SRR7286173 SRR7230313 SRR6079428 SRR7283816 SRR7866088 SRR7975960 SRR8185539 SRR7184327 SRR7944810 SRR8032833 SRR8052166 SRR6053032 SRR6001336 SRR6052779 SRR7842048 SRR7842050 SRR7367365 SRR8032843 SRR7849987 SRR8149128 SRR7285703 SRR8291890 SRR8240407 SRR7866078 SRR7285409 SRR7257041 SRR7251463 SRR7163885 SRR7849990 SRR7286628 SRR7286379 SRR7167525 SRR6001305 SRR6052775 SRR7401693 SRR6052794 SRR7265979 SRR7244293 SRR7276946 SRR7367379 SRR7368879 SRR8172593 SRR7850157 SRR7251474 SRR7251488 SRR7849997 SRR7172298 SRR6052844 SRR7975965 SRR7230329 SRR7265868 SRR7827065 SRR7401642 SRR7208791 SRR7184359 SRR7286371 SRR7187875 SRR7892540 SRR7850218 SRR7474061 SRR7407946 SRR7367469 SRR6052710 SRR7416058 SRR8116846 SRR7850072 SRR7407982 SRR7285478 SRR7285509 SRR7285451 SRR7367541 SRR6053000 SRR7367576 SRR7516051 SRR7283821 SRR7866071 SRR7221091 SRR7187839 SRR6052909 SRR7191463 SRR6052764 SRR7265839 SRR6052906 SRR7184208 SRR8182748 SRR8097432 SRR8263041 SRR7850172 SRR7251452 SRR7290990 SRR7842138 SRR7659108 SRR7836298 SRR7826918 SRR7834592 SRR7659111 SRR7659089 SRR7659106 SRR7236626 SRR7892535 SRR7282609 SRR7286400 SRR7401632 SRR7285444 SRR7223088 SRR7285552 SRR7866087 SRR7474057 SRR7416051 SRR7204337 SRR7942552 SRR7223161 SRR7401659 SRR6052914 SRR6001300 SRR8235462 SRR6052887 SRR6052754 SRR7474049 SRR7223135 SRR6052870 SRR6052770 SRR6052776 SRR7163886 SRR7184350 SRR7230302 SRR8086698 SRR7842101 SRR7274859 SRR6052778 SRR6052718 SRR6052102 SRR7290992 SRR6052756

These are the content of taxons.txt: 83334

bayraktar1 commented 3 months ago

Hi @lane66,

This error is caused by the 83334 taxon ID, which belongs to the NCBI rank “serotype”. Currently, the pipeline only works with the species rank and above. I will improve this soon so that the tool can also retrieve ranks like “strain” and “serotype”. If you still want to download the accessions, you can leave the taxons.txt empty for now.

lane66 commented 3 months ago

Thank you very much for your help! I have cleared the contents of taxons.txt and retained the contents of accessions.txt, but it still didn't work.

Error message: Assuming unrestricted shared filesystem usage for local execution. Building DAG of jobs... Creating conda environment workflow/envs/stats_notebook.yml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/envs/statsnotebook.yml created (location: .snakemake/conda/efe391711166ea8c94168e241b824d97) Creating conda environment workflow/envs/Renv.yml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/workflow/envs/Renv.yml created (location: .snakemake/conda/3632384e6b33db068b67efa1292d052b) Creating conda environment workflow/envs/metadata_notebook.yaml... Downloading and installing remote packages. Environment for /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/envs/metadatanotebook.yaml created (location: .snakemake/conda/172d91faeddf74aca764e0e713a528e2) Using shell: /usr/bin/bash Provided cores: 16 Rules claiming more threads will be scaled down. Job stats: job count


all 1 download_SRAdb 1 platform_stats 1 query_ncbi 1 wrangle_metadata 1 total 5

Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:37:56 2024] localrule download_SRAdb: output: Data/SRAmetadb.sqlite log: logs/download_db/download_SRAdb.log jobid: 3 reason: Missing output files: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

( wget https://gbnci.cancer.gov/backup/SRAmetadb.sqlite.gz -P Data/ && gzip -d Data/SRAmetadb.sqlite.gz ) >logs/download_db/download_SRAdb.log 2>&1 [Mon May 20 12:54:24 2024] Finished job 3. 1 of 5 steps (20%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:54:24 2024] localrule query_ncbi: input: Data/SRAmetadb.sqlite output: results/SRA.feather log: logs/query_ncbi/query_ncbi.log jobid: 2 reason: Missing output files: results/SRA.feather; Input files updated by another job: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

    (workflow/scripts/retrieve_NCBI_metadata.R \
        --database Data/SRAmetadb.sqlite \
        --taxon_id_file Data/taxons.txt \
        --accession_file Data/accessions.txt \
        --output results/SRA.feather) >logs/query_ncbi/query_ncbi.log 2>&1

Activating conda environment: .snakemake/conda/3632384e6b33db068b67efa1292d052b_ [Mon May 20 12:56:41 2024] Finished job 2. 2 of 5 steps (40%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:56:41 2024] localrule wrangle_metadata: input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb jobid: 1 reason: Missing output files: results/metadata.csv; Input files updated by another job: results/SRA.feather resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1, mem_mb=8000, mem_mib=7630, max_mb=16000

Activating conda environment: .snakemake/conda/172d91faeddf74aca764e0e713a528e2_ 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation. Traceback (most recent call last): File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/bin/jupyter-nbconvert", line 11, in sys.exit(main()) ^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/jupyter_core/application.py", line 283, in launch_instance super().launch_instance(argv=argv, kwargs) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance app.start() File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 420, in start self.convert_notebooks() File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 597, in convert_notebooks self.convert_single_notebook(notebook_filename) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 563, in convert_single_notebook output, resources = self.export_single_notebook( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/nbconvertapp.py", line 487, in export_single_notebook output, resources = self.exporter.from_filename( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 201, in from_filename return self.from_file(f, resources=resources, kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 220, in from_file return self.from_notebook_node( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/notebook.py", line 36, in from_notebook_node nb_copy, resources = super().from_notebook_node(nb, resources, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 154, in from_notebook_node nb_copy, resources = self._preprocess(nb_copy, resources) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/exporters/exporter.py", line 353, in _preprocess nbc, resc = preprocessor(nbc, resc) ^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/base.py", line 48, in call return self.preprocess(nb, resources) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/execute.py", line 103, in preprocess self.preprocess_cell(cell, resources, index) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbconvert/preprocessors/execute.py", line 124, in preprocess_cell cell = self.execute_cell(cell, index, store_history=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/jupyter_core/utils/init.py", line 165, in wrapped return loop.run_until_complete(inner) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbclient/client.py", line 1062, in async_execute_cell await self._check_raise_for_error(cell, cell_index, exec_reply) File "/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/nbclient/client.py", line 918, in _check_raise_for_error raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content) nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:

file_path = '../../results/SRA.feather'

data = feather.read_feather(file_path)

data = feather.read_feather(snakemake.input[0])

metadata_df = pd.DataFrame(data) metadata_df = metadata_df.convert_dtypes() metadata_df.set_index('run_accession', inplace=True)

print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---') metadata_df.head()


KeyError Traceback (most recent call last) /scratch/18864336/ipykernel_2813461/1824482230.py in ?() 3 data = feather.read_feather(snakemake.input[0]) 4 5 metadata_df = pd.DataFrame(data) 6 metadata_df = metadata_df.convert_dtypes() ----> 7 metadata_df.set_index('run_accession', inplace=True) 8 9 print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---') 10 metadata_df.head()

/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2/lib/python3.12/site-packages/pandas/core/frame.py in ?(self, keys, drop, append, inplace, verify_integrity) 6118 if not found: 6119 missing.append(col) 6120 6121 if missing: -> 6122 raise KeyError(f"None of {missing} are in the columns") 6123 6124 if inplace: 6125 frame = self

KeyError: "None of ['run_accession'] are in the columns"

RuleException: CalledProcessError in file /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/metadata.smk, line 76: Command 'source /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/envs/panaroo/bin/activate '/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2'; set -euo pipefail; jupyter-nbconvert --log-level ERROR --execute --output /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/logs/wrangle_metadata/processed_notebook.ipynb --to notebook --ExecutePreprocessor.timeout=-1 /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/.snakemake/scripts/tmphyfhr_xy.wrangle_NCBI_metadata.py.ipynb' returned non-zero exit status 1. [Mon May 20 12:56:48 2024] Error in rule wrangle_metadata: jobid: 1 input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb (check log file(s) for error details) conda-env: /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-20T122915.670595.snakemake.log WorkflowError: At least one job did not complete successfully.

bayraktar1 commented 3 months ago

Please make an effort to properly format the logs and include only the relevant parts.

lane66 commented 3 months ago

Hi This is the relevant content in the log.

[Mon May 20 12:54:24 2024] Finished job 3. 1 of 5 steps (20%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:54:24 2024] localrule query_ncbi: input: Data/SRAmetadb.sqlite output: results/SRA.feather log: logs/query_ncbi/query_ncbi.log jobid: 2 reason: Missing output files: results/SRA.feather; Input files updated by another job: Data/SRAmetadb.sqlite resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1

    (workflow/scripts/retrieve_NCBI_metadata.R \
        --database Data/SRAmetadb.sqlite \
        --taxon_id_file Data/taxons.txt \
        --accession_file Data/accessions.txt \
        --output results/SRA.feather) >logs/query_ncbi/query_ncbi.log 2>&1

Activating conda environment: .snakemake/conda/3632384e6b33db068b67efa1292d052b_ [Mon May 20 12:56:41 2024] Finished job 2. 2 of 5 steps (40%) done Select jobs to execute... Execute 1 jobs...

[Mon May 20 12:56:41 2024] localrule wrangle_metadata: input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb jobid: 1 reason: Missing output files: results/metadata.csv; Input files updated by another job: results/SRA.feather resources: tmpdir=/scratch/18864336, runtime=60, partition=cpu, ntasks=1, cpus_per_task=1, mem_mb=8000, mem_mib=7630, max_mb=16000

Activating conda environment: .snakemake/conda/172d91faeddf74aca764e0e713a528e2_ RuleException: CalledProcessError in file /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmids_snakemake/workflow/metadata.smk, line 76: Command 'source /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/envs/panaroo/bin/activate '/hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snake make/conda/172d91faeddf74aca764e0e713a528e2'; set -euo pipefail; jupyter-nbconvert --log-level ERROR --execute --output /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/rec onstruct_plasmids_snakemake/logs/wrangle_metadata/processed_notebook.ipynb --to notebook --ExecutePreprocessor.timeout=-1 /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/rec onstruct_plasmids_snakemake/.snakemake/scripts/tmphyfhr_xy.wrangle_NCBI_metadata.py.ipynb' returned non-zero exit status 1. [Mon May 20 12:56:48 2024] Error in rule wrangle_metadata: jobid: 1 input: results/SRA.feather output: results/metadata.csv, results/clean_tsv.tsv log: logs/wrangle_metadata/processed_notebook.ipynb (check log file(s) for error details) conda-env: /hpc/local/CentOS7/uu_vet_iras/pliu_anaconda/reconstruct_plasmidssnakemake/.snakemake/conda/172d91faeddf74aca764e0e713a528e2

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-20T122915.670595.snakemake.log WorkflowError: At least one job did not complete successfully.

bayraktar1 commented 3 months ago

This error occurs because the SRA.feather file is empty. This means that the database did not contain any of the accessions you submitted.

I checked the database manually for a couple of samples you provided, and they were not present. They do, however, seem to be findable on the NCBI website. The studies related to the runs seem to be in the database as well. For example, SRR7850007 is part of the SRP071789 study, and that study has 500 runs in the database.

So this seems to be an issue with the NCBI and the SRA database dump, which I cannot do anything about. Maybe the accession for these studies were updated recently, and the database dump has not been updated by the NCBI yet. If all the accessions are from the same studies, you could try using the study accession instead.