bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
60 stars 4 forks source link

add back ska distance and ska annotate? #25

Closed cammo0p closed 1 year ago

cammo0p commented 1 year ago

ska distance (from version 1) was helpful in clustering and created a cluster. tsv and a .dot file for uploading to Microreact, which was quite handy for visualisation. Would you consider adding that function? (correct me if I need to be corrected, becos pp-sketch can only calculate the distance with fasta/fastq, but not with skf?!)

Also, would like to ask if we use bedtools to replace ska annotate. If I'm using a reference-free approach, how am I supposed to transfer the file formats from skf to bedtools-compatible formats?

Appreciate your reply and help.

johnlees commented 1 year ago

I've also had another request for the distance method to be added, so that meets my current threshold for doing this! Tracking here https://github.com/bacpop/ska.rust/issues/26 and will endeavour to add in the next month

Annotation is only possible using a reference (as in v1 of ska). I would probably use ska map to generate a VCF, then you can annotate function with many different tools. If you wanted to annotate all the split k-mers, you can use ska nk --full-info to print out all of the k-mers, map these to reference positions with something like bwa fastmap, then use bedtools intersect on the coordinates. Currently I feel this is too specialised a use case to add direct support for.

cammo0p commented 1 year ago

Thank you for your prompt response and for considering the addition of the distance method.

I appreciate your effort in addressing the requests from the community. As for the annotation, I understand that it's a specialized use case, and I appreciate your suggestions on how to achieve this using other tools in conjunction with SKA. I will explore the suggested workflow using ska map, bwa fastmap, and bedtools intersect to annotate the split k-mers.

I also wanted to mention that I've been encountering memory issues while using ska merge with a larger number of samples (90 in my case). Despite dividing the samples into smaller groups and merging them sequentially, I still face issues when merging certain groups. I was wondering if you could consider optimizing memory usage for the ska merge function to handle larger sample sizes.

johnlees commented 1 year ago

I also wanted to mention that I've been encountering memory issues while using ska merge with a larger number of samples (90 in my case). Despite dividing the samples into smaller groups and merging them sequentially, I still face issues when merging certain groups. I was wondering if you could consider optimizing memory usage for the ska merge function to handle larger sample sizes.

Would you be able to provide some more information here? How much memory do you have i.e. at what point are you running out. How many k-mers do you have in the files (ska nk)? If the samples are divergent the number of k-mers when merged will increase a lot, so it may be one or two samples causing the issue.

I do have potential plans to offset some memory use to disk at the expense of increased runtime (#17), but this is more involved so can't guarantee I'll have time to do it.

cammo0p commented 1 year ago

I am currently working with a dataset consisting of 90 samples, and each sample has around 19 million k-mers, totaling approximately 1.7 billion k-mers across all samples. My system has 1 TB of RAM, which should be sufficient for most computational tasks. However, given the large number of samples and the k-mer count in each, I encountered a "killed" message during the analysis, possibly due to running out of memory.

I would appreciate any suggestions on how to optimize the workflow or use alternative approaches that can handle the large dataset efficiently. I am open to using more memory or distributed computing resources if needed to successfully complete the analysis.

Looking forward to your guidance and suggestions.

johnlees commented 1 year ago

An estimate for RAM use is k-mers * (samples + 8), so even if these k-mers are all unshared I'd guess around 200-400Gb. Can you give the command you ran? Using one thread would be sensible.

Memory use should be lower than this though, as where ska works well most of the k-mers will be shared between the samples. Can you try running build on a few smaller numbers of your samples (e.g. 2 samples, 4 samples, 8, 16, 32) and 1) check the memory use with /usr/bin/time -v (or gtime on OS X), reported under max resident set size at the end 2) see how many k-mers are in the final .skf files by running ska nk on them, and/or also adding the verbose flag -v to your run.