EvolBioInf / andi

♥ Efficient Estimation of Evolutionary Distances
GNU General Public License v3.0
44 stars 15 forks source link

distance matrix name length #7

Closed kloetzl closed 7 years ago

kloetzl commented 8 years ago

The PHYLIP format for distance matrices only allows identifiers of up to 9 (or 10?) characters. Unfortunately this means that sometimes names are cutoff, making them indistinguishable. A --name-length option might help.

schultzm commented 7 years ago

Hi,

I'm trying to run andi on 1843 bacterial isolate genomes. I pass andi the list of isolates for analysis from a file using a pipe and xargs: cat filenames.txt | xargs andi -j -m JC -t 64 > andi_JCdist.mat

filenames.txt contains a space-delimited list of filepaths. The total length of the string is filenames.txt is 164389 characters.

The analysis runs without errors. But the output matrix contains two phylip matrices. The first matrix is 1468x1468 and the second is 375x375.
I would expect that if the second matrix was a continuation of the first, they should both have 1843 columns. It appears that the analysis has been broken down into two blocks, so an all vs all comparison for the 1843 isolates has not been performed. Only an all vs all comparison within each of the blocks has been performed. That is, I was expecting 1843^2 cell values, but only got 1468^2+375^2 values.

Thanks,

Mark

kloetzl commented 7 years ago

Hi Mark,

Thanks for your interest in andi. The problem you are facing is of a different nature, though. From the xargs man page:

The command line for command is built up until it reaches a system-defined limit (…). The specified command will be invoked as many times as necessary to use up the list of input items.

So the command you are trying to build either has too many files, or the path lengths are filling the shells buffer. You can check the limits for your system via xargs --show-limits. For my system, the output is:

Your environment variables take up 1359 bytes POSIX upper limit on argument length (this system): 2093745 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2092386 Size of command buffer we are actually using: 131072 Maximum parallelism (--max-procs must be no greater): 2147483647

You might want to try increasing you systems limits. If that doesn't work, I could supply you with a custom version of andi that can read filenames from a list.

Hope this helps, Fabian

ps. For future reference: Please open a new Issue to start a new thread of discussion.

schultzm commented 7 years ago

Hi Fabian,

Thanks for the reply. Apologies for posting in an old issue.

My xargs limits are the same as yours. In the user manual it states that andi accepts filenames from stdin but I could not get it to do that, which is why I used xargs. Yes, please send me the version that accepts a file of file names. Will you also publicly release this version? That would be useful to a lot of people I’m sure.

Cheers,

Mark

tseemann commented 7 years ago

The next problem you will face is that by default most processes are limited to 1024 open files; see ulimit -a.

Does andi open all the input files at once?
Or one at a time and close them as it goes?

kloetzl commented 7 years ago

Yes, please send me the version that accepts a file of file names. Will you also publicly release this version?

I am already working on it and it will get into the next official release.

Or one at a time and close them as it goes?

One after the other.

kloetzl commented 7 years ago

I pushed some commits that fix the problems in this issue. A preliminary version, supporting the new --file-of-filenames parameter can be downloaded from here: https://kloetzl.info/downloads/andi-0.11-beta.tar.gz You will need to follow the instructions to install from “source package” in the manual. The “fof” file should contain exactly one path per line. Also, the last path needs to be followed by a line break. Otherwise, you will receive weird error messages. I will work on making the code more robust after lunch. :smiley:

Tip: If you run andi with the option --verbose it will output the number of sequences it compares early on. That way you can abort, if the numbers don't match.

tseemann commented 7 years ago

Thank you for your help @kloetzl

kloetzl commented 7 years ago

Let me know, if andi successfully completed the big run. Then I can close this issue.

kloetzl commented 7 years ago

I am closing this issue. Both problems should be fixed in the current master and thus in the upcoming release.