TimoLassmann / kalign

A fast multiple sequence alignment program.
GNU General Public License v3.0
124 stars 29 forks source link

Memory consumption usage #42

Closed Tang-pro closed 5 months ago

Tang-pro commented 6 months ago

Hi, @TimoLassmann

Great software. Here I am using version 2.04. I have 180,000 transcript sequences. Here is how much memory I need.

Best!

TimoLassmann commented 6 months ago

The amount of memory needed cannot be predicted in advance. It depends on the sequence lengths, their similarity and alignment parameters. I am curious about your specific task: why do you need to align 180K transcripts?

Tang-pro commented 6 months ago

Hi, @TimoLassmann I built the full-length isoforms of two species based on Pacbio. I want to compare the isoforms of these two species and use PhastCons to evaluate the conservation of the isoforms of these two species. In fact, this is just an attempt and I don’t know how to do it. Is it reasonable? And when I use this software, I get an error message due to insufficient memory.

TimoLassmann commented 5 months ago

Hi, From your description it sounds like you are attempting to align all transcripts from different genes at the same time. You should consider aligning the transcripts to reference genomes, collect transcripts mapping to homologous regions and perform your analysis on a gene by gene basis. If no reference genome is available, you could look into using high throughput unsupervised clustering approaches to group similar sequences first, then perform your analysis on the clusters one by one.

Tang-pro commented 5 months ago

Hi, @TimoLassmann Sorry, maybe my previous statement misunderstood you. I have a reference genome, and I have also compared the transcripts to the reference genome. But my current purpose is to conduct a conservative evaluation of these isoforms. My 180,000 isoforms can be divided into seven categories, with the largest category having 80,000 or 90,000. Can Kalign support this?