akcorut / kGWASflow

kGWASflow is a Snakemake workflow for performing k-mers-based GWAS.
https://github.com/akcorut/kGWASflow/wiki
MIT License
28 stars 8 forks source link

Error in rule merge_kmers for the test. #27

Open Orz-CQ opened 8 months ago

Orz-CQ commented 8 months ago

Hi @akcorut,

The errors occurred while I am testing this workflow by kgwasflow test -t 5 --conda-frontend mamba

The error log from snakemake is

[Tue Dec 26 16:25:16 2023]
Job 556: Merging outputs from two KMC k-mers counting results into one list for each sample/individual...
Reason: Missing output files: results/kmers_count/individual_53/kmers_with_strand

Activating conda environment: .snakemake/conda/2b06538935fcbe16d2cb6053889bd233_
Error: flag 00 should be equal to zero.
This is likely due to running the KMC non-canonized with -ci not 1
[Tue Dec 26 16:25:16 2023]
Error in rule merge_kmers:
    jobid: 556
    input: results/kmers_count/individual_53/output_kmc_canon.kmc_suf, results/kmers_count/individual_53/output_kmc_canon.kmc_pre, results/kmers_count/individual_53/output_kmc_all.kmc_suf, results/kmers_count/individual_53/output_kmc_all.kmc_pre, results/kmers_count/individual_53/kmc_canonical.done, results/kmers_count/individual_53/kmc_non-canonical.done, scripts/external/kmers_gwas/bin
    output: results/kmers_count/individual_53/kmers_with_strand, results/kmers_count/individual_53/kmers_add_strand_information.done
    log: logs/count_kmers/kmc/individual_53/add_strand.log.out (check log file(s) for error details)
    conda-env: /public/home/lanlan/software/kGWASflow/test_work/.snakemake/conda/2b06538935fcbe16d2cb6053889bd233_
    shell:

        export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

        scripts/external/kmers_gwas/bin/kmers_add_strand_information -c results/kmers_count/individual_53/output_kmc_canon -n results/kmers_count/individual_53/output_kmc_all -k 25 -o results/kmers_count/individual_53/kmers_with_strand > logs/count_kmers/kmc/individual_53/add_strand.log.out

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Error! The Snakemake workflow aborted.

While I also tested the single code line scripts/external/kmers_gwas/bin/kmers_add_strand_information -c results/kmers_count/individual_53/output_kmc_canon -n results/kmers_count/individual_53/output_kmc_all -k 25 -o results/kmers_count/individual_53/kmers_with_strand it return

Canonized kmers:        10242
Non-canon kmers:        4064
Non-canon kmers found:  4064
flag    0       count is        6252
flag    1       count is        1930
flag    2       count is        1986
flag    3       count is        74
Error: flag 00 should be equal to zero.
This is likely due to running the KMC non-canonized with -ci not 1

Could you give me some suggestions?

Orz-CQ commented 8 months ago

Hi @akcorut and happy new year!

Here is my suggestion, could you add a function that export all the processes into a bash file?

for example, if I used the same test,

kgwasflow test -t 5 --conda-frontend mamba

it will generate a bash file and we could run this pipeline by single run as bash XX.sh.

Moreover, could you combine all these required environments into a single conda yaml?

Lan

brunacama93 commented 6 months ago

Hello @akcorut,

Bumping this thread because I'm also experiencing the same issue with a test run on the ecoli dataset. I tried running it with kgwasflow test -t 16 --snake-default

Dry run performs correctly.

@Orz-CQ, did you ever solve the issue, yourself?

I'm also attaching the full log file 2024-03-01T094921.431143.snakemake.log

Thanks in advance for your help!

> [Fri Mar  1 10:17:55 2024]
Job 587: Merging outputs from two KMC k-mers counting results into one list for each sample/individual...
Reason: Missing output files: results/kmers_count/individual_81/kmers_with_strand

Activating conda environment: .snakemake/conda/c6c38832695d6d6755994dd3624fff4b_
Error: flag 00 should be equal to zero.
This is likely due to running the KMC non-canonized with -ci not 1
Error: flag 00 should be equal to zero.
This is likely due to running the KMC non-canonized with -ci not 1
[Fri Mar  1 10:17:56 2024]
Error in rule merge_kmers:
    jobid: 590
    input: results/kmers_count/individual_84/output_kmc_canon.kmc_suf, results/kmers_count/individual_84/output_kmc_canon.kmc_pre, results/kmers_count/individual_84/output_kmc_all.kmc_suf, results/kmers_count/individual_84/output_kmc_all.kmc_pre, results/kmers_count/individual_84/kmc_canonical.done, results/kmers_count/individual_84/kmc_non-canonical.done, scripts/external/kmers_gwas/bin
    output: results/kmers_count/individual_84/kmers_with_strand, results/kmers_count/individual_84/kmers_add_strand_information.done
    log: logs/count_kmers/kmc/individual_84/add_strand.log.out (check log file(s) for error details)
    conda-env: /scratch/bcama/my_directory/kgwas/.snakemake/conda/c6c38832695d6d6755994dd3624fff4b_
    shell:

        export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

        scripts/external/kmers_gwas/bin/kmers_add_strand_information -c results/kmers_count/individual_84/output_kmc_canon -n results/kmers_count/individual_84/output_kmc_all -k 25 -o results/kmers_count/individual_84/kmers_with_strand > logs/count_kmers/kmc/individual_84/add_strand.log.out

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Mar  1 10:17:56 2024]
Error in rule merge_kmers:
    jobid: 587
    input: results/kmers_count/individual_81/output_kmc_canon.kmc_suf, results/kmers_count/individual_81/output_kmc_canon.kmc_pre, results/kmers_count/individual_81/output_kmc_all.kmc_suf, results/kmers_count/individual_81/output_kmc_all.kmc_pre, results/kmers_count/individual_81/kmc_canonical.done, results/kmers_count/individual_81/kmc_non-canonical.done, scripts/external/kmers_gwas/bin
    output: results/kmers_count/individual_81/kmers_with_strand, results/kmers_count/individual_81/kmers_add_strand_information.done
    log: logs/count_kmers/kmc/individual_81/add_strand.log.out (check log file(s) for error details)
    conda-env: /scratch/bcama/my_directory/kgwas/.snakemake/conda/c6c38832695d6d6755994dd3624fff4b_
    shell:

        export LD_LIBRARY_PATH=$CONDA_PREFIX/lib

        scripts/external/kmers_gwas/bin/kmers_add_strand_information -c results/kmers_count/individual_81/output_kmc_canon -n results/kmers_count/individual_81/output_kmc_all -k 25 -o results/kmers_count/individual_81/kmers_with_strand > logs/count_kmers/kmc/individual_81/add_strand.log.out

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Error! The Snakemake workflow aborted.
yiwenwang25 commented 4 months ago

Hello! I ran into the same issue and was wondering anyone solved this issue yet?

nikostr commented 3 months ago

The error comes from the kmer GWAS code found here https://github.com/voichek/kmersGWAS/blob/master/src/kmers_add_strand_information.cpp. In the kmer GWAS manual it clearly says to run the non-canonized with -ci0, so I'm not sure what's going on with the error message here.

When I'm looking at the log files for the preceding steps it seems like in the cases where I'm getting these errors, not all reads have been processed for the non-canonical step. If I run snakemake --force results/kmers_count/individual_87/kmc_non-canonical.done for the individuals that fail and then run the rest of the pipeline I get things to work.

To me this is worrying. Why is the non-canonical step marked as complete when not all reads are processed? Could this lead to hard-to-detect errors where enough - but not all - reads are processed?

EDIT: I just went back to my log files and verified that I have cases where not all reads are processed in the canonical step, but the pipeline still runs. You probably want to take a look at this @akcorut.

nikostr commented 3 months ago

It turns out that KMC may not process all reads when the number of threads is limited. I've submitted an issue with KMC here https://github.com/refresh-bio/KMC/issues/235.

marekkokot commented 3 months ago

Hello there :) Thanks for using KMC, I responded in the created issue. In this specific case (I mean the issue posted on kmc repo) the cause is in ill-formed input fastq file (at least for one file its true, I have not checked remaining but I guess this is the same case). I would like to point that its not that KMC may not process all reads, its actually more like "undefined behaviour", so for example it is possible that not only reads are missing, but that some other parts of file are treaten as reads etc. I know it would be nice if KMC have better mechanism to detect correctnes of input files, and eventually exit with some error message, but its not trivial, and I don't expect adding this in the near future. Anyway if you guys have some other examples or questions I'm happy to assist, and thank you again for using KMC.

Best Marek

nikostr commented 3 months ago

Thanks a ton for looking closer at this, @marekkokot ! I should have looked at the test data before submitting an issue with KMC. I just ran the pipeline using the SRA data for the E. coli example, and did not encounter this issue. I had some issues getting the data to download correctly from SRA however, maybe that was the reason for @brunacama93 getting this error with the E. coli dataset?