BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
115 stars 10 forks source link

How to run "generate_sequence_features_single" with UNSORTED bam #169

Open GuoYang-qd opened 2 months ago

GuoYang-qd commented 2 months ago

Thank you for developing such an excellent tool as semibin2, which performs exceptionally well and can generate a large number of high-quality MAGs.

Therefore, we are interested in applying semibin2 to the analysis of our large datasets. Considering that the analysis of large datasets is usually very time-consuming, we hope to streamline the pipline as much as possible.

Sorting Bam files often consumes a significant amount of computational and storage resources (e.g., temporary files when sorting are usually hundreds of Gbs per bam in our case). However, it seems that Semibin2 does not support unsorted bam as input, as an error occurs when running the "generate_sequence_features_single" module:

Input error: Chromosome k127_4971567 found in non-sequential lines. This suggests that the input file is not sorted correctly.

I would like to ask if there are any alternative tools or ways to generate the "data.csv" and "data.split.csv" based on unsorted bam files? Or, is it possible to make simple modifications on the "generate_sequence_features_single" module to adapt it to unsorted bam?

luispedro commented 2 months ago

Unfortunately, it's not trivial to use non-sorted files. It's conceptually possible (we do so in NGLess), but not in a way that fits semibin

GuoYang-qd commented 1 month ago

Thanks for the reply. Currently, I can generate tetramer frequencies in "data.csv". The abundance calculated by NGLess seems to be similar to the trend of abundance generated by Bedtools in semibin. So, can the abundance calculated by NGLess replace the abundance calculated by Bedtools?

Additionally, I noticed that "data_split.csv" appears to sample the contig from "data.csv", and then split its abundance and tetramer frequencies into two numbers (it seems the average of this two values is the number in "data.csv"). How is this process achieved? Could you briefly introduce the logic behind it?

Thanks!