conchoecia / odp

oxford dot plots
GNU General Public License v3.0
134 stars 10 forks source link

Jobs/rules are running out of sequence and causing input file errors #47

Closed changsarahl closed 9 months ago

changsarahl commented 1 year ago

Hi, thank you for this software. I was wondering whether there was a way to ensure that all of the rules ran in sequence for species pairs?

I am attempting to run this with a large number of species (approx. 40), and I seem to be running into errors when certain input files are missing. The input is being run over 60 cores, and the snakemake error file is attached.

Currently, I am running this by specifying specific rules to be run one by one in snakemake, but please let me know if there is another way.

input code: snakemake -r -p --cores 60 --snakefile /home/krablab/odp/scripts/odp

2023-07-13T224127.325685.snakemake.txt

conchoecia commented 1 year ago

Hello!

I was wondering whether there was a way to ensure that all of the rules ran in sequence for species pairs?

I'm not completely sure what you mean, but when you run snakemake and it says "100% complete" at the end of the run, or you try to run the same snakemake script again in the same directory it will say, "Nothing to be done". This means that all of the jobs completed successfully.

Currently, I am running this by specifying specific rules to be run one by one in snakemake, but please let me know if there is another way.

I am not sure what you mean, sorry! The command you posted, snakemake -r -p --cores 60 --snakefile /home/krablab/odp/scripts/odp, will perform all of the comparisons of the species in the yaml file.

The job that failed somewhere and didn't complete

.
.
.
.
Finished job 3623.
2774 of 7817 steps (35%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-07-13T224127.325685.snakemake.log

To figure out what went wrong you'll have to re-run snakemake, then paste the errors here. It looks like the log file that you pasted only contains the stdout and not the stderr, so I can't see what the errors are. Try again and maybe I can pinpoint what went wrong?

changsarahl commented 1 year ago

Hi, thank you for the quick reply, this seems to be the error: filtered_D_FET_rbh is the last thing that is struggling to run.

snakemake -r -p --cores 40 --snakefile /mnt/krab1/SLC_data/odp/scripts/odp Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 40 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


all 1 1 1 filtered_D_FET_rbh 741 1 1 total 742 1 1

Select jobs to execute...

[Mon Jul 24 10:31:44 2023] rule filtered_D_FET_rbh: input: odp/step1-rbh/Ccha_Mamb_reciprocal_best_hits.D.FET.rbh output: odp/step1-rbh-filtered/Ccha_Mamb_reciprocal_best_hits.D.FET.filt.rbh jobid: 3853 reason: Missing output files: odp/step1-rbh-filtered/Ccha_Mamb_reciprocal_best_hits.D.FET.filt.rbh wildcards: analysis=Ccha_Mamb resources: tmpdir=/tmp

RuleException: TypeError in file /mnt/krab1/SLC_data/odp/scripts/odp, line 1839: Must provide 'func' or tuples of '(column, aggfunc). File "/mnt/krab1/SLC_data/odp/scripts/odp", line 1839, in __rule_filtered_D_FET_rbh File "/home/krablab/.local/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 865, in aggregate File "/home/krablab/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1260, in reconstruct_func File "/home/krablab/Documents/apps/smcpp/envs/odp/lib/python3.8/concurrent/futures/thread.py", line 57, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

changsarahl commented 1 year ago

Also, is the output of filtered_D_FET_rbh required to create the ribbon plots? I am unsure whether the directory needed to be specified in the ribbon_plot config.yaml file is to the step1-rbh directory or the step2-figures/synteny-nocolor directory.

Thanks!

arcosintan commented 1 year ago

I'm having a similar issue, also in rule filtered_D_FET_rbh. This seems to be caused by the missing parameter of the .agg() method in 1839 of the odp code, and I don't know how to modify it. Below is my bug report, config file.

err:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job stats:
job                   count    min threads    max threads
------------------  -------  -------------  -------------
all                       1              1              1
filtered_D_FET_rbh        1              1              1
total                     2              1              1

Select jobs to execute...

[Thu Jul 27 10:45:54 2023]
rule filtered_D_FET_rbh:
    input: odp/step1-rbh/Myes_Pmax_reciprocal_best_hits.D.FET.rbh
    output: odp/step1-rbh-filtered/Myes_Pmax_reciprocal_best_hits.D.FET.filt.rbh
    jobid: 11
    reason: Missing output files: odp/step1-rbh-filtered/Myes_Pmax_reciprocal_best_hits.D.FET.filt.rbh
    wildcards: analysis=Myes_Pmax
    resources: tmpdir=/tmp

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Select jobs to execute...
[Thu Jul 27 10:45:55 2023]
Error in rule filtered_D_FET_rbh:
    jobid: 0
    input: odp/step1-rbh/Myes_Pmax_reciprocal_best_hits.D.FET.rbh
    output: odp/step1-rbh-filtered/Myes_Pmax_reciprocal_best_hits.D.FET.filt.rbh

RuleException:
TypeError in file /home/bio_soft/odp/scripts/odp, line 1839:
Must provide 'func' or tuples of '(column, aggfunc).
  File "/home/bio_soft/odp/scripts/odp", line 1839, in __rule_filtered_D_FET_rbh
  File "/home/miniconda3/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1265, in aggregate
  File "/home/miniconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1198, in reconstruct_func
  File "/home/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-07-27T104553.171315.snakemake.log

config.yaml:

#
# This is an example config file for odp/scripts/odp
#
# # To use this software first copy this config file to your analysis directory
# cp odp/example_configs/CONFIG_odp.yaml ./config.yaml
# # Then modify the config file to include your own data
# vim config.yaml
# # Then run the pipeline
# snakemake -r -p --snakefile odp/scripts/odp

ignore_autobreaks: True       # Skip steps to find breaks in synteny blocks
diamond_or_blastp: "diamond"  # "diamond" or "blastp"
duplicate_proteins: "pass"    # currently only "fail" or "best". Fail doesn't allow duplicate names or seqs
plot_LGs: True                # Plot the ALGs based on the installed databases
plot_sp_sp: True              # Plot the synteny between two species, if False just generates .rbh files

species:
  Pmax:
    proteins: Pmax.lfaa
    chrom: Pmax.chrom
    genome: Pmax.fna
    minscafsize: 3000000  # Only plots scaffolds that are 1 Mbp or longer
  Myes:
    proteins: Myes.lfaa
    chrom: Myes.chrom
    genome: Myes.fna
    minscafsize: 3000000  # Only plots scaffolds that are 1 Mbp or longer

Thanks for any help!

conchoecia commented 9 months ago

Should be working with the update https://github.com/conchoecia/odp/commit/ba6c45375c2bf5e75a9ff0779e5d75b5509209db

Please run git pull from within the odp directory to update, try again, and reopen this issue if you have the same problem.