dentearl / mafTools

Bioinformatics tools for dealing with Multiple Alignment Format (MAF) files.
Other
104 stars 32 forks source link

mafSorter Seg fault for large files #14

Closed 4ureliek closed 7 years ago

4ureliek commented 7 years ago

Hi, I used mafSorter on 120 maf files (outputs of mafStrander), and it is failing for the 29 largest ones with this error: Segmentation fault (core dumped)

Which makes me think that there is a memory allocation issue? I watched one of the job, and I don't see the memory usage go up much though (and I am on a 2 TB memory machine). Do you have any suggestion to solve this? Thank you!

The smallest file that fails is 2423694954 large And the largest file that does not fail is 2388039004 large

dentearl commented 7 years ago

So 2.42G vs 2.39G, hmm. Are the files so large because they have lots of blocks or because there are a few blocks that are enormous?

One way to really get at this would be to compile with debugging on and run it through gdb. That would give us the location of the failure at least.

4ureliek commented 7 years ago

Thank you for the quick reply! Based on your comment of the block numbers I counted them in all files, and largest files are in fact largest because they have more blocks (size and block counts correlate, R2=0.995). The counts range from 1081237 to 3982663 for the files that failed, and up to 1038319 blocks for the files that did not fail. This does not exclude that there are very large block in the files that failed though - but I do not know how to compile with debugging on? If you send me a command line I will check. I could also filter out small blocks and see what happens? Is this an option in one of the mafTools?

diekhans commented 7 years ago

With gcc, you can compile with both optimization and debugging, so there is no real reason to not have debugging always turned on (unless you care about making your executable smaller).

Dent Earl notifications@github.com writes:

So 2.42G vs 2.39G, hmm. Are the files so large because they have lots of blocks or because there are a few blocks that are enormous?

One way to really get at this would be to compile with debugging on and run it through gdb. That would give us the location of the failure at least.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

dentearl commented 7 years ago

@4ureliek I don't think there's a specific mafTool do filter blocks by size. If there were, it'd live in mafFilter. One thing, does the maf validate? When you run it through mafValidator, does it come out clean? I bet it does but it'd be an easier fix for me if it doesn't validate ;) Compiling with debugging is as easy as passing -g -O0 to gcc, iirc.

@diekhans yeah, I guess we could leave the -g and -O0 flags on all the time but eh. :)

4ureliek commented 7 years ago

I ran this: python ~/mafTools/bin/mafValidator.py --maf=superscaffold8.strand.maf

And it said 'done' but I did not see anything in stderr or stdout? Does that mean it passed, or I am missing something?

Thanks!

diekhans commented 7 years ago

@diekhans yeah, I guess we could leave the -g and -O0 flags on all the time but eh. :)

@dentearl, gcc lets you do -g -O3; he can be a little weird to step through in the debugger when code is reordered, but you can still get good stack traces, etc on optimized code. This was a ground breaking feature at one time ...

dentearl commented 7 years ago

@4ureliek Yeah, iirc 'done' is the best you can hope for. Hm. Can you put the failing maf somewhere publicly available where I can download it and experiment? I can't promise much because of competing priorities but I'm curious enough to spend some more time on this.

@diekhans huh, #til.

4ureliek commented 7 years ago

Hi, I completely forgot to share with you one of the files... I need an email to share the folder where it's at, could you email me at 4urelie.k (at) gmail ? Thanks!

4ureliek commented 7 years ago

I ended up using the code from here: https://raw.githubusercontent.com/UCSantaCruzComputationalGenomicsLab/last/master/scripts/maf-sort.sh and it seems to be doing the trick on these big files.