ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
160 stars 38 forks source link

hal2mafMP.py duplicates alignment blocks #5

Open mikolmogorov opened 10 years ago

mikolmogorov commented 10 years ago

hal2mafMP.py duplicates aligment blocks when running in multiple processes. i.e. for arbitary maf: ... a s name1 aa bb .. s name2 cc dd ..

a s name1 aa bb .. s name2 cc dd.. ... problem does not appear in ordinary hal2maf executable, so it is probably something wrong with splitting reference. How to reproduce:

hal2maf hal maf hal2mafMP.py hal maf_mp --numProc 4 ls -l

Result: maf_mp is longer

glennhickey commented 10 years ago

Hi Mikhail,

Which options are you using? In particular, are you specifying --maxRefGap

0?

thanks -Glenn

On Fri, Mar 7, 2014 at 11:15 AM, Mikhail Kolmogorov < notifications@github.com> wrote:

hal2mafMP.py duplicates aligment blocks when running in multiple processes. i.e. for arbitary maf: ... a s name1 aa bb .. s name2 cc dd ..

a s name1 aa bb .. s name2 cc dd.. ... problem does not appear in ordinary hal2maf executable, so it is probably something wrong with splitting reference. How to reproduce:

hal2maf hal maf hal2mafMP.py hal maf_mp --numProc 4 ls -l

Result: maf_mp is longer

Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5 .

mikolmogorov commented 10 years ago

Hi Glenn, thanks for your reply. Here is the exact command line:

hal2mafMP.py helicobacter.hal --maxRefGap 100 --numProc 4 helicobacter.maf --noAncestors --refGenome sjm180

With this file: https://drive.google.com/file/d/0B1pUguR1yn7TSmtzRlNFMnVPa3M/edit?usp=sharing Result - some of the alignment blocks produced are overlapping (which doesn't appear in non-parallel version)

If I remove --maxRefGap option, the program even crashes:

Traceback (most recent call last):
  File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 311, in <module>
    sys.exit(main())
  File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 308, in main
    runParallelSlices(args)
  File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 170, in runParallelSlices
    runParallelShellCommands(sliceCmds, options.numProc)
  File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/stats/halStats.py", line 41, in runParallelShellCommands
    result.get(sys.maxint)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
RuntimeError: Command: hal2maf helicobacter/out_complete/helicobacter.hal helicobacter/out_complete/helicobacter_parallel2_hal2mafTempHAKDI_4.maf --unique --noAncestors --refGenome sjm180 --start 1658048 --length 3 exited with non-zero status -11
glennhickey commented 10 years ago

Thanks, that made it very easy to reproduce the problem.

hal2mafMP.py seg fault on your data when maxRefGap not set:

should be fixed now

hal2mafMP.py creating duplicate blocks when maxRefGap > 0

this is a known issue that should be better documented. Only workaround is to use single process or set maxRefGap to 0 (if you still get dupe blocks under these conditions please let me know). If anyone has efficient code to filter out duplicate MAF blocks, it would be ideal to incorporate it into the merge step of hal2mafMP but I never had time to write this.

The underlying issue is that the parallel slicing occurs along the reference genome. The maxRefGap option allows the tool to follow indel events within the graph, leading to blocks that have no coordinates in the reference. But there's no system in place to canonically map such blocks to their leftmost insertion point in the reference, so I can't check if they belong in a given slice.

-Glenn

On Wed, Mar 12, 2014 at 8:34 AM, Mikhail Kolmogorov < notifications@github.com> wrote:

Hi Glenn, thanks for your reply. Here is the example command line:

hal2mafMP.py helicobacter.hal --maxRefGap 100 --numProc 4 helicobacter.maf --noAncestors --refGenome sjm180

With this file: https://drive.google.com/file/d/0B1pUguR1yn7TSmtzRlNFMnVPa3M/edit?usp=sharing

If I remove --maxRefGap option, the program crashes:

Traceback (most recent call last): File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 311, in sys.exit(main()) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 308, in main runParallelSlices(args) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/bin/hal2mafMP.py", line 170, in runParallelSlices runParallelShellCommands(sliceCmds, options.numProc) File "/home/volrath/Bioinf/tools/progressiveCactus/submodules/hal/stats/halStats.py", line 41, in runParallelShellCommands result.get(sys.maxint) File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get raise self._value RuntimeError: Command: hal2maf helicobacter/out_complete/helicobacter.hal helicobacter/out_complete/helicobacter_parallel2_hal2mafTempHAKDI_4.maf --unique --noAncestors --refGenome sjm180 --start 1658048 --length 3 exited with non-zero status -11

Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5#issuecomment-37403044 .

mikolmogorov commented 10 years ago

Sorry for late response. I updated from development branch and now hal2mafMP.py works fine without --maxRefGap. Thanks!

I was a bit confused because I've got overlapping blocks on one particular dataset, but the reason was that there were sequences with same names in different genomes. Do you think user should get a kind of warning in this case? (or advice to use --ucscNames option)

joelarmstrong commented 10 years ago

I agree that there should be a warning or error when there are duplicate sequence names. Or maybe we should just make --ucscNames the default. I've run into the same problem before, where I forgot to use --ucscNames on a large alignment and ended up with a pretty useless MAF.

I'll try to get this done sometime soon.

benedictpaten commented 10 years ago

Yes, we do I think throw an exception is a single genome contains a repeat name, but we don't look between samples.

On Wed, Mar 19, 2014 at 11:08 AM, Joel Armstrong notifications@github.comwrote:

I agree that there should be a warning or error when there are duplicate sequence names. Or maybe we should just make --ucscNames the default. I've run into the same problem before, where I forgot to use --ucscNames on a large alignment and ended up with a pretty useless MAF.

I'll try to get this done sometime soon.

Reply to this email directly or view it on GitHubhttps://github.com/glennhickey/hal/issues/5#issuecomment-38086414 .