ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
164 stars 39 forks source link

Support large numbers of children #212

Closed glennhickey closed 4 months ago

glennhickey commented 3 years ago

When calling this is in hdf5Genome.cpp

  _bottomArray.create(&_group, bottomArrayName, Hdf5BottomSegment::dataType(numChildren), numBottomSegments + 1, &botDC,  _numChunksInArrayBuffer); 

The compound datatype returned by Hdf5BottomSegment::dataType grows with the number of children. Apparently the size of the dataype cannot exceed 64kb, which in practice seems to cap the number of children at about 545.

This has come up as an issue because someone was crazy enough to try the Cactus Pangenome Pipeline, which uses a star tree, on 1000s of bacteria genomes. As they mention, there seems to be hope for a workaround in the form of H5Pset_attr_phase_change().

There is some documentation on it (and another possible workaroudn) here. It seems simple enough to be worth a try, since the number of children is (I think) always known a priori, the toggle can only be used when absolutely necessary, preserving backwards compatibility.

The more immediate workaround is to use --format mmap, but I'm not sure how robust that will be (it produced a corrupt file with my small maf2hal test)

diekhans commented 3 years ago

Glenn Hickey @.***> writes:

The more immediate workaround is to use --format mmap, but I'm not sure how robust that will be (it produced a corrupt file with my small maf2hal test)

When we implemented mmap format, we maf2hal didn't really work reliable with HDF5, so it was probably never tested with mmap.

glennhickey commented 3 years ago

When I add

    hid_t cparms_id = cparms.getId();
    herr_t ret = H5Pset_attr_phase_change(cparms_id, 0, 0);
    assert(ret >= 0);

above _dataSet = _file->createDataSet(_path, _dataType, _dataSpace, cparms); in hdf5ExternalArray::create()

I get the same error as before. @jrvalverde I think your best bet for a workaround is adding --format mmap do the command that is creating your HAL file. This bypasses hdf5 entirely in favour of a custom format. It's much faster but since it's not compressed, also much bigger. All hal tools should work natively on it and it won't be subject to this particular limitation (still can't guarantee it'll work though).

jrvalverde commented 3 years ago

Thanks, I will try that later, I do not really worry much about space, I think I have plenty, so if it works that's great for me.

It'll just take some time, I've got to go get the COVID vaccine today, correct a paper and answer a number of student requests, so today's plenty of work. I think I'll launch the run now and see if I can monitor the results later. Yes, that'll be the best. I'll let you know how it goes. Then later I'll also have a try to the hal code as well.

Thank you so very much, your help and support are excellent!

            j

On Tue, 18 May 2021 12:51:35 -0700 Glenn Hickey @.***> wrote:

When I add

    hid_t cparms_id = cparms.getId();
    herr_t ret = H5Pset_attr_phase_change(cparms_id, 0, 0);
    assert(ret >= 0);

above _dataSet = _file->createDataSet(_path, _dataType, _dataSpace, cparms); in hdf5ExternalArray::create()

I get the same error as before. @jrvalverde I think your best bet for a workaround is adding --format mmap do the command that is creating your HAL file. This bypasses hdf5 entirely in favour of a custom format. It's much faster but since it's not compressed, also much bigger. All hal tools should work natively on it and it won't be subject to this particular limitation (still can't guarantee it'll work though).

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/hal/issues/212#issuecomment-843508631

-- Scientific Computing Service Centro Nacional de Biotecnología, CSIC. c/Darwin, 3. 28049 Madrid +34 91 585 45 05 +34 659 978 577

jrvalverde commented 3 years ago

On Tue, 18 May 2021 12:51:35 -0700 Glenn Hickey @.***> wrote:

I get the same error as before. @jrvalverde I think your best bet for a workaround is adding --format mmap do the command that is creating your HAL file. This bypasses hdf5 entirely in favour of a custom format. It's much faster but since it's not compressed, also much bigger. All hal tools should work natively on it and it won't be subject to this particular limitation (still can't guarantee it'll work though).

Thank you so very much, I have tried manually to run halAppendCactusSubree with the --format mmap option and it did complete, generating the hal file; so I have modified the source code in cactus_progressive.py and in cactus_constructFromIntermediates.py to add the '--format' and 'mmap' strings to the start of the argument list and am running cactus again. For now it seems to work, I'm leaving it running and will see what happened tomorrow.

If it works, I'll send you back a full list of all the changes I made to the source code to make it run.

Crossing my fingers,

            j

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/hal/issues/212#issuecomment-843508631

-- Scientific Computing Service Centro Nacional de Biotecnología, CSIC. c/Darwin, 3. 28049 Madrid +34 91 585 45 05 +34 659 978 577

hhuili commented 2 years ago

Hello, I have encountered the same problem, could you please provide the script you modified? Thank you very much🥳! This is my email ahhhoh@163.com

glennhickey commented 2 years ago

You'd want to change this line

cactus_call(parameters=["halAppendCactusSubtree"] + args)

to

cactus_call(parameters=["halAppendCactusSubtree"] + args + ['--format', 'mmap'])

(making sure this happens in your virtualenv, or you rerun pip install -U . after making the change). This should get around the child limit, at the cost of making a much bigger HAL file.

This is an issue (along with some other HDF5-related bottlenecks) that remains on our radar and I hope to fix in the next few months. It's a big project though.

hhuili commented 2 years ago

You'd want to change this line

cactus_call(parameters=["halAppendCactusSubtree"] + args)

to

cactus_call(parameters=["halAppendCactusSubtree"] + args + ['--format', 'mmap'])

(making sure this happens in your virtualenv, or you rerun pip install -U . after making the change). This should get around the child limit, at the cost of making a much bigger HAL file.

This is an issue (along with some other HDF5-related bottlenecks) that remains on our radar and I hope to fix in the next few months. It's a big project though.

Thank you very much! I modified the script as you said, but another error occurred:

[2021-12-08T19:40:38+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2021-12-08T19:40:38+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-JobFunctionWrappingJob/instance-mvvjisx5/file-d37c3cd725bb48e19fe0ad48af1c75fd/Anc0_experiment.xml' to path '/tmp/b6383c7845f0557386da6b815d4b0c5e/2f51/7e66/tmp5cw6ux__.tmp'
[2021-12-08T19:40:38+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-f0bf8c828de04348822d21edae12c5ae/config.xml' to path '/tmp/b6383c7845f0557386da6b815d4b0c5e/2f51/7e66/tmp4bs5x19w.tmp'
[2021-12-08T19:40:38+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CactusConsolidated/instance-y685hzgx/file-b8a1aa658d564e79bf19966de73f6996/tmppad1_0gh.tmp' to path '/tmp/b6383c7845f0557386da6b815d4b0c5e/2f51/7e66/tmp192jqjq9.tmp'
[2021-12-08T19:40:38+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CactusConsolidated/instance-y685hzgx/file-3d6b927ed7ca4532b42ae6c21fc1ac2b/tmpil1oarxx.tmp' to path '/tmp/b6383c7845f0557386da6b815d4b0c5e/2f51/7e66/tmpkvkxoaqn.tmp'
Traceback (most recent call last):
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/toil/worker.py", line 393, in workerScript
    job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/toil/job.py", line 2360, in _runner
    returnValues = self._run(jobGraph=None, fileStore=fileStore)
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/toil/job.py", line 2281, in _run
    return self.run(fileStore)
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/toil/job.py", line 2504, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/cactus/progressive/cactus_progressive.py", line 335, in exportHal
    cactus_call(parameters=["halSetMetadata", HALPath, "CACTUS_COMMIT", cactus_commit])
  File "/public/zpmiao/software/cactus-bin-v2.0.4/venv/lib/python3.8/site-packages/cactus/shared/common.py", line 866, in cactus_call
    raise RuntimeError("Command {} exited {}: {}".format(call, process.returncode, out))
RuntimeError: Command ['docker', 'run', '--interactive', '--net=host', '--log-driver=none', '-u', '1004:1005', '-v', '/tmp/b6383c7845f0557386da6b815d4b0c5e/2f51/7e66:/data', '--entrypoint', '/opt/cactus/wrapper.sh', '--name', '84221427-ebf1-4f92-871b-1d45928864cb', '--rm', 'quay.io/comparative-genomics-toolkit/cactus:eca7219f3943465b73f240dd86b5e8e228162144', 'halSetMetadata', 'tmp_alignment.hal', 'CACTUS_COMMIT', 'eca7219f3943465b73f240dd86b5e8e228162144'] exited 1: stdout=None, stderr=Running command catchsegv 'halSetMetadata' 'tmp_alignment.hal' 'CACTUS_COMMIT' 'eca7219f3943465b73f240dd86b5e8e228162144'
terminate called after throwing an instance of 'hal_exception'
  what():  tmp_alignment.hal: file is marked as dirty, most likely an inconsistent state.
Aborted (core dumped)

[2021-12-08T19:40:38+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host localhost.localdomain

<=========

glennhickey commented 2 years ago

Oh, you may need the --format mmap to all subsequent hal commands in the method or, even safer, just comment them out.

diekhans commented 2 years ago

--format mmap only need to be specified when the hal is created. Subsequent commands recognized the format by examining the hearer of the file.

The above exception indicates that file was not successfully closed during the last write operation.

diekhans commented 2 years ago

This is because we never put in any kind of locking in hal mmap, assuming only one accessor for writing

hhuili commented 2 years ago

Oh, you may need the --format mmap to all subsequent hal commands in the method or, even safer, just comment them out.

Thank you very much for your help! it worked! But when I use hal2maf --format mmap to convert it to a maf file I get the same error, it seems that I can't convert the hal file to other formats. Is there another way to convert this hal file? Actually, what I really need is a maf or fasta file of the multiple whole genome alignment.

hal exception caught: out.hal: file is marked as dirty, most likely an inconsistent state.
hhuili commented 2 years ago

--format mmap only need to be specified when the hal is created. Subsequent commands recognized the format by examining the hearer of the file.

The above exception indicates that file was not successfully closed during the last write operation.

As you said, I only specified the --format mmap when the hal file was created (I commented out the subsequent hal commands), and I successfully got an output hal file, but when I converted it to other formats, I got the same error. How should I solve it? I would very appreciate it if you can help me!😄😄

diekhans commented 2 years ago

The error means that the HAL file was being written and that it didn't complete successfully. Since the mmap format is not transactional, there is no way to recover from it. I don't believe the HDF5 can be recovered either if a write operation doesn't complete.

hhuili commented 2 years ago

Thank you, the last question😂. I would like to know if there is an intermediate alignment file in cactus-align before it is converted into a hal file, such as fasta or maf?