Jtrachsel / gifrop

Genomic Islands From Roary Pangenomes
GNU General Public License v2.0
8 stars 0 forks source link

ERROR: 'All_islands.fasta' does not exist, or is unreadable #9

Open yl030 opened 1 month ago

yl030 commented 1 month ago

Hello gifrop team, I installed gifrop through manual and when I run % gifrop --get_islands

This is gifrop 0.0.9 command issued: /gss1/App_os7/miniconda3/envs/gifrop/bin/gifrop --get_islands ===== Dependencies check ===== parallel .... good abricate .... good Rscript .... good find .... good [1] "All required R packages were detected" /gss2/home_new/xuefeng01/gff/gene_presence_absence.csv exist found 3299 .gff files WRANGLING SEQUENCE DATA... making shortened gffs... Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:

Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi'). Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice: https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

found 3299 .gff files extracting fastas from prokka gffs... Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:

Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi'). Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice: https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

DONE WRANGLING SEQUENCE DATA EXECUTING Rscript 'gifrop_id.R' [1] "loading packages" Warning message: package ‘dplyr’ was built under R version 4.2.3 Warning message: package ‘tidyr’ was built under R version 4.2.3 Warning message: package ‘readr’ was built under R version 4.2.3 Warning message: package ‘purrr’ was built under R version 4.2.3 [1] "done loading packages" Warning message: One or more parsing issues, call problems() on your data frame for details, e.g.: dat <- vroom(...) problems(dat) [1] "reading in gffs..." Joining with by = join_by(seqid, locus_tag) Error in left_join(): ! This join would result in more rows than dplyr can handle. 5723840911 rows would be returned. 2147483647 rows is the maximum number allowed. Double check your join keys. This error commonly occurs due to a missing join key, or an improperly specified join condition. Backtrace: ▆

  1. ├─... %>% select(genome, seqid, seqid_loc_tags)
  2. ├─dplyr::select(., genome, seqid, seqid_loc_tags)
  3. ├─dplyr::ungroup(.)
  4. ├─tidyr::nest(., seqid_loc_tags = c(locus_tag, loc_tag_order))
  5. ├─dplyr::select(., genome, seqid, locus_tag, loc_tag_order)
  6. ├─dplyr::left_join(., loc_tag_orders)
  7. ├─dplyr:::left_join.data.frame(., loc_tag_orders)
  8. │ └─dplyr:::join_mutate(...)
  9. │ └─dplyr:::join_rows(...)
    1. │ └─dplyr:::dplyr_locate_matches(...)
    2. │ ├─base::withCallingHandlers(...)
    3. │ └─vctrs::vec_locate_matches(...)
    4. ├─vctrs:::stop_matches_overflow(size = 5723840911, call = <env>)
    5. │ └─vctrs:::stop_matches(...)
    6. │ └─vctrs:::stop_vctrs(...)
    7. │ └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
    8. │ └─rlang:::signal_abort(cnd, .file)
    9. │ └─base::signalCondition(cnd)
    10. └─dplyr (local) <fn>(<vctrs___>)
    11. └─dplyr:::rethrow_error_join_matches_overflow(cnd, error_call)
    12. └─dplyr:::stop_join(...)
    13. └─dplyr:::stop_dplyr(...)
    14. └─rlang::abort(...) Execution halted DONE EXECUTING 'gifrop_id.R' RUNNING ABRICATE ON THE ISLANDS Using nucl database ncbi: 5386 sequences - 2023-Nov-4 Processing: All_islands.fasta ERROR: 'All_islands.fasta' does not exist, or is unreadable
Jtrachsel commented 1 month ago

Hello!

It looks like this is a pretty large pangenome you are working with. Unfortunately gifrop isn't designed for use on very large pangenomes.

This portion of the error message is the real issue:

Error in left_join():
! This join would result in more rows than dplyr can handle.
5723840911 rows would be returned. 2147483647 rows is the maximum number
allowed.

My recommendation is to reduce the size of the pangenome you are working with, maybe focus on a subset of genomes you are interested in. Otherwise you may need to consider using a different tool that has been designed for very large datasets. I've had good luck with ppanggolin though you will need to do some of the classification steps that gifrop performs manually.

yl030 commented 1 month ago

Thanks, Jtrachsel!