Closed cebel closed 7 years ago
Found this article recommends to set collect_ids=False like I did.
parser = etree.XMLParser(collect_ids=False)
entries = etree.fromstringlist(entries_xml, parser)
But even this not helps.
I found a workaround: Reload lxml etree for every chunk of XML entries. Stupid, but it works and memory usage dramatically is reduced.
import pyuniprot
pyuniprot.update(taxids=[9606,10090,10116]) # human, mouse, rat
After ~42k entries in the database the update process consumes ~12Gb of memory. I assume a problem in lxml. Found several articles describing similar problems with big XML files. BUT here I avoid to load the whole document in the iterparser (tested is and directly starts with 5Gb of memory consumption and then constantly increases). If the problem won't be solved it seem not feasible to load the whole UniProt.