TimoLassmann / kalign

A fast multiple sequence alignment program.
GNU General Public License v3.0
128 stars 28 forks source link

Segmentation fault (core dumped) - 10 million sequence dataset #34

Closed Shubhangi1397 closed 2 years ago

Shubhangi1397 commented 2 years ago

Hi TimoLassmann/kalign, I hope this email finds you well. As mentioned earlier I am working on large datasets of SARS-CoV-2 sequences. My current dataset has ~10 million sequences. I am getting a segmentation fault again during alignment. It ran fine with ~1 million sequences.

[2022-10-27 10:45:32] : LOG : reading fasta [2022-10-27 10:56:22] : LOG : Detected protein sequences. [2022-10-27 10:57:53] : LOG : CPU Time: 932.92u 00:15:32.91 Elapsed: 00:15:33.00 [2022-10-27 10:57:53] : LOG : Detected: 10842878 sequences. [2022-10-27 10:57:56] : LOG : Calculating pairwise distances [2022-10-27 11:23:25] : LOG : CPU Time: 3737.16u 01:02:17.15 Elapsed: 00:25:29.00 [2022-10-27 11:23:25] : LOG : 32 anchors [2022-10-27 11:23:25] : LOG : Building guide tree. [2022-10-27 11:56:22] : LOG : CPU Time: 14340.20u 03:59:00.19 Elapsed: 00:32:57.00 [2022-10-27 11:58:47] : LOG : Aligning Segmentation fault (core dumped)

TimoLassmann commented 2 years ago

Hi, I never aligned that many sequences! Would it be possible for you to share the input file with me? Thanks, T

Shubhangi1397 commented 2 years ago

Hi Timo, I was wondering if you can please share your email ID with me? I will share the input file link with you :) My contact email ID is @.***

Many thanks, Shubhangi

On Fri, 28 Oct 2022 at 00:38, TimoLassmann @.***> wrote:

Hi, I never aligned that many sequences! Would it be possible for you to share the input file with me? Thanks, T

— Reply to this email directly, view it on GitHub https://github.com/TimoLassmann/kalign/issues/34#issuecomment-1294235574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYFGVFXGS5W7GVOYBGISMXDWFMG6PANCNFSM6AAAAAARQAH3IY . You are receiving this because you authored the thread.Message ID: @.***>

--

Shubhangi Kandwal

PhD Student in Biochemistry,

School of Biochemistry and Immunology,

Trinity Biomedical Sciences Institute (TBSI),

Trinity College Dublin, University of Dublin

email: @.***

Shubhangi1397 commented 2 years ago

Hi Timo, I hope this email finds you well. I can't share the dataset (~10million sequences) here on Github due to the data sharing policy of the database. But I can send you the input file via a google link to your email ID. I think if I share the link here it will also appear on Github. Hope to hear from you soon. Best regards, Shubhangi

On Fri, 28 Oct 2022 at 12:58, Shubhangi Kandwal @.***> wrote:

Hi Timo, I was wondering if you can please share your email ID with me? I will share the input file link with you :) My contact email ID is @.***

Many thanks, Shubhangi

On Fri, 28 Oct 2022 at 00:38, TimoLassmann @.***> wrote:

Hi, I never aligned that many sequences! Would it be possible for you to share the input file with me? Thanks, T

— Reply to this email directly, view it on GitHub https://github.com/TimoLassmann/kalign/issues/34#issuecomment-1294235574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYFGVFXGS5W7GVOYBGISMXDWFMG6PANCNFSM6AAAAAARQAH3IY . You are receiving this because you authored the thread.Message ID: @.***>

--

Shubhangi Kandwal

PhD Student in Biochemistry,

School of Biochemistry and Immunology,

Trinity Biomedical Sciences Institute (TBSI),

Trinity College Dublin, University of Dublin

email: @.***

--

Shubhangi Kandwal

PhD Student in Biochemistry,

School of Biochemistry and Immunology,

Trinity Biomedical Sciences Institute (TBSI),

Trinity College Dublin, University of Dublin

email: @.***

TimoLassmann commented 2 years ago

Great - send me the link.

TimoLassmann commented 2 years ago

The input file contained a handful of sequence entries without a sequence. To address this, kalign now runs some basic checks on the input before the alignment steps.