ddarriba / ForeSeqs

Forecasing Sequences for multigene alignments
GNU General Public License v2.0
2 stars 1 forks source link

Seqmentation fault #2

Closed carloliveros closed 7 years ago

carloliveros commented 7 years ago

Hi Diego,

I was able to compile foreseqs in a CentOS machine with 64GB of memory. The binary runs fine with the example dataset but I encounter a segmentation fault before any text is written on the screen with a dataset with 221 taxa, 2.4 million bp, and 4060 partitions. If you wish to get a copy of the input files, I can send it to you privately.

Cheers Carl

ddarriba commented 7 years ago

 Hi Carl,That would be great. That way I can easily check what happened.

stamatak commented 7 years ago

possibilities of what could be the cause:

  1. out of memory
  2. not casting to size_t in some mallocs
  3. insuficcient integer range somewhere, i.e., int instead of unsigned long or size_t in some products for calculating required bytes of memory etc.

those were the typical causes for such errors in RAxML,

Alexis

On 06.04.2017 02:56, ddarriba wrote:

Hi Carl,That would be great. That way I can easily check what happened. El 5 abr. 2017 10:27 PM, Carl Oliveros notifications@github.com escribió:Hi Diego, I was able to compile foreseqs in a CentOS machine with 64GB of memory. The binary runs fine with the example dataset but I encounter a segmentation fault before any text is written on the screen with a dataset with 221 taxa, 2.4 million bp, and 4060 partitions. If you wish to get a copy of the input files, I can send it to you privately. Cheers Carl

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ddarriba/ForeSeqs/issues/2#issuecomment-292030644, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-lECy4y6yc56yP_xHq203IQqmHbVks5rtCpKgaJpZM4M0xJK.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

carloliveros commented 7 years ago

Hi Diego and Alexis,

I just sent Diego a Dropbox link to the data files. Let me know what you find out or if you need additional info about the dataset.

Cheers Carl

ddarriba commented 7 years ago

Hi Carlos,

The error happens because the number of partitions in PLL is hardcoded. There is a definition in pll.h as follows:

#define PLL_NUM_BRANCHES 2900

Since your data contains 4,060 partitions, you need to set that variable to 4060 or higher. Then recompile the library and that should do the trick.

Cheers, Diego.

carloliveros commented 7 years ago

Hi Diego,

I recompiled PLL and foreseqs and that got rid of the segmentation fault. ForeSeqs ran until it said: "Seting fixed topology; Loading alignment; Initializing model." The process was then killed because it ran out of memory in my 64GB-memory machine. Does ForeSeqs need at least the same amount of memory as RAxML? The RAxML binary conversion function says that I need at least 48 GB of memory. Do I need at least twice that amount too?

Carl

stamatak commented 7 years ago

On 07.04.2017 21:23, Carl Oliveros wrote:

Hi Diego,

I recompiled PLL and foreseqs and that got rid of the segmentation fault. ForeSeqs ran until it said: "Seting fixed topology; Loading alignment; Initializing model." The process was then killed because it ran out of memory in my 64GB-memory machine. Does ForeSeqs need at least the same amount of memory as RAxML?

yes, approximately ...

The RAxML binary conversion function says that I need at least 48 GB of memory. Do I need at least twice that amount too?

I am afraid so,

alexis

Carl

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ddarriba/ForeSeqs/issues/2#issuecomment-292614234, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-pUV2JHUkPFyqfBFLCD5p5Rltx21ks5rtn8KgaJpZM4M0xJK.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

carloliveros commented 7 years ago

Hi Alexis and Diego,

Do you have any plans on coming up with a version of this software that can run in parallel in multi-core/multi-node environments? It's been running for 48 hours now on the "Optimizing per-gene branch lengths / model parameters" step on my dataset on a single core (using up 68.1GB of memory in case you are curious).

Cheers Carl

carloliveros commented 7 years ago

Hi Alexis and Diego,

Does the branch length of the reference tree have to be optimized only with the partitions without missing data? Your paper defines the reference tree as such. My concern with using only the partitions without missing data to optimize the branch length is that I only have 15 such partitions out of 4060, and so they may not have enough information for branch length optimization. Is it ok to use the full dataset? Although this of course will have the long branches for tips with a lot of missing data. Any advice?

Also my previous run did not complete because of wall time constraints. Do you think I can do this by dividing up my dataset into smaller groups of partitions and running the analysis on individual groups?

Cheers Carl

stamatak commented 7 years ago

Dear Carl,

Does the branch length of the reference tree have to be optimized only with the partitions without missing data?

No, the branch lengths are optimized on all partitions. Subsequently the algorithm steals branch lengths from partitions without missing data if I am not mistaken (Diego please confirm).

Your paper defines the reference tree as such. My concern with using only the partitions without missing data to optimize the branch length is that I only have 15 such partitions out of 4060, and so they may not have enough information for branch length optimization.

The branch lengths are not optimized but stolen from these 15 partitions. In my original implementation branch lengths for a specific bipartition of the tree were stolen from all partitions that have data on both sides of that bipartition, but I can't remember how Diego implemented this.

Is it ok to use the full dataset? Although this of course will have the long branches for tips with a lot of missing data.

Those will not be taken into account for branch length stealing.

Any advice?

Also my previous run did not complete because of wall time constraints. Do you think I can do this by dividing up my dataset into smaller groups of partitions and running the analysis on individual groups?

Maybe, as long as all 15 complete partitions are contained in these smaller datasets.

Alexis

Cheers Carl

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ddarriba/ForeSeqs/issues/2#issuecomment-295435636, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-p_eVufkPa26CDTIuIZaZRYMDWV3ks5rxnNhgaJpZM4M0xJK.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

ddarriba commented 7 years ago

That is right. The default behavior is as you described, taking the average of all the 'existing' branch lengths.

carloliveros commented 7 years ago

I am able to run the program after dividing my dataset into 4 parts each with ~ 1000 partitions. ForeSeqs runs fine with 3 of the 4 parts but I am encountering another error on one of the parts. Opening a new issue.