Open mikolmogorov opened 10 years ago
Hi Mikhail,
Which options are you using? In particular, are you specifying --maxRefGap
0?
thanks -Glenn
On Fri, Mar 7, 2014 at 11:15 AM, Mikhail Kolmogorov < notifications@github.com> wrote:
hal2mafMP.py duplicates aligment blocks when running in multiple processes. i.e. for arbitary maf: ... a s name1 aa bb .. s name2 cc dd ..
a s name1 aa bb .. s name2 cc dd.. ... problem does not appear in ordinary hal2maf executable, so it is probably something wrong with splitting reference. How to reproduce:
hal2maf hal maf hal2mafMP.py hal maf_mp --numProc 4 ls -l
Result: maf_mp is longer
Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5 .
Hi Glenn, thanks for your reply. Here is the exact command line:
hal2mafMP.py helicobacter.hal --maxRefGap 100 --numProc 4 helicobacter.maf --noAncestors --refGenome sjm180
With this file: https://drive.google.com/file/d/0B1pUguR1yn7TSmtzRlNFMnVPa3M/edit?usp=sharing Result - some of the alignment blocks produced are overlapping (which doesn't appear in non-parallel version)
If I remove --maxRefGap option, the program even crashes:
Traceback (most recent call last):
File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 311, in <module>
sys.exit(main())
File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 308, in main
runParallelSlices(args)
File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 170, in runParallelSlices
runParallelShellCommands(sliceCmds, options.numProc)
File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/stats/halStats.py", line 41, in runParallelShellCommands
result.get(sys.maxint)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
RuntimeError: Command: hal2maf helicobacter/out_complete/helicobacter.hal helicobacter/out_complete/helicobacter_parallel2_hal2mafTempHAKDI_4.maf --unique --noAncestors --refGenome sjm180 --start 1658048 --length 3 exited with non-zero status -11
Thanks, that made it very easy to reproduce the problem.
hal2mafMP.py seg fault on your data when maxRefGap not set:
should be fixed now
hal2mafMP.py creating duplicate blocks when maxRefGap > 0
this is a known issue that should be better documented. Only workaround is to use single process or set maxRefGap to 0 (if you still get dupe blocks under these conditions please let me know). If anyone has efficient code to filter out duplicate MAF blocks, it would be ideal to incorporate it into the merge step of hal2mafMP but I never had time to write this.
The underlying issue is that the parallel slicing occurs along the reference genome. The maxRefGap option allows the tool to follow indel events within the graph, leading to blocks that have no coordinates in the reference. But there's no system in place to canonically map such blocks to their leftmost insertion point in the reference, so I can't check if they belong in a given slice.
-Glenn
On Wed, Mar 12, 2014 at 8:34 AM, Mikhail Kolmogorov < notifications@github.com> wrote:
Hi Glenn, thanks for your reply. Here is the example command line:
hal2mafMP.py helicobacter.hal --maxRefGap 100 --numProc 4 helicobacter.maf --noAncestors --refGenome sjm180
With this file: https://drive.google.com/file/d/0B1pUguR1yn7TSmtzRlNFMnVPa3M/edit?usp=sharing
If I remove --maxRefGap option, the program crashes:
Traceback (most recent call last): File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 311, in
sys.exit(main()) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 308, in main runParallelSlices(args) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 170, in runParallelSlices runParallelShellCommands(sliceCmds, options.numProc) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/stats/halStats.py", line 41, in runParallelShellCommands result.get(sys.maxint) File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get raise self._value RuntimeError: Command: hal2maf helicobacter/out_complete/helicobacter.hal helicobacter/out_complete/helicobacter_parallel2_hal2mafTempHAKDI_4.maf --unique --noAncestors --refGenome sjm180 --start 1658048 --length 3 exited with non-zero status -11 Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5#issuecomment-37403044 .
Sorry for late response. I updated from development branch and now hal2mafMP.py works fine without --maxRefGap. Thanks!
I was a bit confused because I've got overlapping blocks on one particular dataset, but the reason was that there were sequences with same names in different genomes. Do you think user should get a kind of warning in this case? (or advice to use --ucscNames option)
I agree that there should be a warning or error when there are duplicate sequence names. Or maybe we should just make --ucscNames the default. I've run into the same problem before, where I forgot to use --ucscNames on a large alignment and ended up with a pretty useless MAF.
I'll try to get this done sometime soon.
Yes, we do I think throw an exception is a single genome contains a repeat name, but we don't look between samples.
On Wed, Mar 19, 2014 at 11:08 AM, Joel Armstrong notifications@github.comwrote:
I agree that there should be a warning or error when there are duplicate sequence names. Or maybe we should just make --ucscNames the default. I've run into the same problem before, where I forgot to use --ucscNames on a large alignment and ended up with a pretty useless MAF.
I'll try to get this done sometime soon.
Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5#issuecomment-38086414 .
hal2mafMP.py duplicates aligment blocks when running in multiple processes. i.e. for arbitary maf: ... a s name1 aa bb .. s name2 cc dd ..
a s name1 aa bb .. s name2 cc dd.. ... problem does not appear in ordinary hal2maf executable, so it is probably something wrong with splitting reference. How to reproduce:
hal2maf hal maf hal2mafMP.py hal maf_mp --numProc 4 ls -l
Result: maf_mp is longer