akcorut / kGWASflow

kGWASflow is a Snakemake workflow for performing k-mers-based GWAS.
https://github.com/akcorut/kGWASflow/wiki
MIT License
28 stars 8 forks source link

kmer counting issue Segmentation fault #3

Closed Kxpark closed 1 year ago

Kxpark commented 1 year ago

Howdy, I was able to have the pipeline run all of the quality control, but when it gets to kmer counting it fails. I am using the new version that you released yesterday V1.1

Thank you for the help!

image

image

akcorut commented 1 year ago

Hi @Kxpark,

This seems to be a memory issue. You are mostly likely giving this step more memory than you actually have. You might want to reconsider the number of threads and the memory you are using. kmc shouldn't need too much memory. Are you running this in a cluster environment or in a local computer?

Best, Kivanc

Kxpark commented 1 year ago

I am running on a local computer. I thought I was being conservative with my allocation of cores, but I can try it even lower.

Kxpark commented 1 year ago

Even with 1 core, it still seems to have an issue. I have plenty of RAM so that shouldnt be the limit for this step.

image

akcorut commented 1 year ago

You can also specify how much memory you want kmc to use by using the -m parameter. You can do that by adding this parameter in kmc extra params section of the config file:

https://github.com/akcorut/kGWASflow/blob/a8ad46421b7b64c54f7401fa5e95ad9324eee890/config/config.yaml#L144-L157

akcorut commented 1 year ago

@Kxpark Can you also share the content of one of the kmc_canon.log files to see if there is more information there?

Kxpark commented 1 year ago

let me know if I am doing it wrong,

I appreciate the help

image image

Kxpark commented 1 year ago

here is one, however most of them are empty.

K-Mer Counter (KMC) ver. 3.1.1 (2019-05-19) Usage: kmc [options] kmc [options] <@input_file_names> Parameters: input_file_name - single file in specified (-f switch) format (gziped or not) @input_file_names - file name with list of input files in specified (-f switch) format (gziped or not) Options: -v - verbose mode (shows all parameter settings); default: false -k - k-mer length (k from 1 to 256; default: 25) -m - max amount of RAM in GB (from 1 to 1024); default: 12 -sm - use strict memory mode (memory limit from -m switch will not be exceeded) -p - signature length (5, 6, 7, 8, 9, 10, 11); default: 9 -f<a/q/m/bam> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam); default: FASTQ -ci - exclude k-mers occurring less than times (default: 2) -cs - maximal value of a counter (default: 255) -cx - exclude k-mers occurring more of than times (default: 1e9) -b - turn off transformation of k-mers into canonical form -r - turn on RAM-only mode -n - number of bins -t - total number of threads (default: no. of CPU cores) -sf - number of FASTQ reading threads -sp - number of splitting threads -sr - number of threads for 2nd stage -j - file name with execution summary in JSON format -w - without output Example: kmc -k27 -m24 NA19238.fastq NA.res /data/kmc_tmp_dir/ kmc -k27 -m24 @files.lst NA.res /data/kmc_tmp_dir/ kmc_canon.log (END)

akcorut commented 1 year ago

kmc expects the -m parameter to be in Gb so you need to specify Gb with the -m parameter:

-m - max amount of RAM in GB (from 1 to 1024); default: 12

Kxpark commented 1 year ago

I see, after changing it to "-m 2" I got this output

image

akcorut commented 1 year ago

Can you share the content of logs/count_kmers/kmc/Syn276/kmc_canon.log?

Kxpark commented 1 year ago

K-Mer Counter (KMC) ver. 3.1.1 (2019-05-19) Usage: kmc [options] kmc [options] <@input_file_names> Parameters: input_file_name - single file in specified (-f switch) format (gziped or not) @input_file_names - file name with list of input files in specified (-f switch) format (gziped or not) Options: -v - verbose mode (shows all parameter settings); default: false -k - k-mer length (k from 1 to 256; default: 25) -m - max amount of RAM in GB (from 1 to 1024); default: 12 -sm - use strict memory mode (memory limit from -m switch will not be exceeded) -p - signature length (5, 6, 7, 8, 9, 10, 11); default: 9 -f<a/q/m/bam> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam); default: FASTQ -ci - exclude k-mers occurring less than times (default: 2) -cs - maximal value of a counter (default: 255) -cx - exclude k-mers occurring more of than times (default: 1e9) -b - turn off transformation of k-mers into canonical form -r - turn on RAM-only mode -n - number of bins -t - total number of threads (default: no. of CPU cores) -sf - number of FASTQ reading threads -sp - number of splitting threads -sr - number of threads for 2nd stage -j - file name with execution summary in JSON format -w - without output Example: kmc -k27 -m24 NA19238.fastq NA.res /data/kmc_tmp_dir/ kmc -k27 -m24 @files.lst NA.res /data/kmc_tmp_dir/ kmc_canon.log (END)

akcorut commented 1 year ago

Can you try extra: "-m2" instead of "-m 2"?

Kxpark commented 1 year ago

Here I tried "-m1" and I am running it single core

image

Kxpark commented 1 year ago

I am continuing to get the same "Segmentation fault" even if I increase my cores to 8 and leave the "-m2" as is.

I am open to creative ideas if you have any.

I should have plenty of resources for this step.

akcorut commented 1 year ago

kmc will most likely need more than "-m1" depends on your data. How big is your data? Below is an example from 'kmc' developers to give you an idea about how much memory you might need:

In our experiments, KMC was able to count 28-mers in human genome sequencing data of gzipped size 614GB (>736 Gbases) in only 33GB of RAM.

https://github.com/refresh-bio/KMC/issues/174#issuecomment-962220455

Kxpark commented 1 year ago

does it look at each individual alone or the whole population, either way each individual is a few mb and the whole population of raw reads is maybe 120gb

Kxpark commented 1 year ago

I even increased "-m20" and am still getting similar results. I have 21gb of free ram so it shouldnt be a problem.

image

Kxpark commented 1 year ago

I FIXED IT!

so in the /workflow/envs/kmc.yaml

I changed the dependency from kmc 3.1.1 to 3.2.1 the most recent update and it is running now.

Thank you for the help troubleshooting

akcorut commented 1 year ago

I'm glad it is running now. It is weird that I have never had problem with kmc 3.1.1 before. I will look into this and maybe change the default version to 3.2.1. Thanks again for reporting.

Best, Kivanc