althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
120 stars 12 forks source link

Load more than 1024 HMM files #48

Closed chtsai0105 closed 1 year ago

chtsai0105 commented 1 year ago

Hi - I have a related question about loading multiple HMM files mentioned in #24.

I'm working on a tool that use pyhmmer hmmsearch module to find orthologs among several sequences against the busco dataset. It worked great when searching against the fungi_odb10 dataset (contains 758 markers). But when I tested with mammalia_odb10 (with 9,226 markers) and vertebrata_odb10 (with 3,354 markers) markersets, a file not found error occurred even that file do exist.

I did a bunch of test and found that the file not found error always occurred on the 1020th markers. And later I realize it might be related to the system constrain. Many of the system limit a user to open up to 1024 files at the same time. (according to cmd ulimit -a)

Although might not be directly related to your package, do you have any suggestion about opening up to 1024 files through the context manager? Or is that possible to keep the HMM information after close the file?

althonos commented 1 year ago

Hi, you can indeed pre-load all you HMMs in memory first (into a list) so that you don't have to keep all the files open, hmmsearch can be given any iterable of HMM, not just a HMMFile. You could also change the implementation from #24 into an iterator that would open and close each file iteratively instead, e.g.

class HMMFiles(typing.Iterable[HMM]):
    def __init__(self, files):
        self.files = list(files)
    def __iter__(self):
        for file in self.files:
            with HMMFile(file) as hmm_file:
                yield from hmm_file
chtsai0105 commented 1 year ago

Thank you! I almost forgot the hmmsearch can also take HMM iterable as input.