DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

omark_contextualize.py ERRORS: Max retries exceeded / Too many open files #25

Open nam-hoang opened 8 months ago

nam-hoang commented 8 months ago

Dear OMArk team,

I am testing omark_contextualize.py using the provided example data and also my real data, and ran into an error which seems to have something to do with API connection Max retries exceeded/too many files opened.

16%|█████████████▋ | 412/2657 [04:17<22:35, 1.66it/s] requests.exceptions.SSLError: HTTPSConnectionPool(host='omabrowser.org', port=443): Max retries exceeded with url: /api/protein/12907256/ (Caused by SSLError(OSError(24, 'Too many open files')))

The omark_contextualize.py fragment or omark_contextualize.py missing runs with the example data (less sequences) were completed, but those with my data (more sequences) were stopped midway. The same error also occurred when I tried omark_contextualize.py assembly for both example data and my data.

I wonder if you would be able to advise me in this case? What would be the cause of this, and is there anything I could change to make this work? I would like to use this tool to improve my genome annotation.

Thank you very much and I am looking forward to hearing from you.

Best regards, Nam

alpae commented 8 months ago

Hi @nam-hoang ,

the error looks a lot like some temporary problems with connecting to the OMA browser API. could you provide us with some more detail how and when you run that script?

Thanks Adrian

nam-hoang commented 8 months ago

Thanks Adrian @alpae for your reply,

I ran the above separate commands on a Linux server (Ubuntu). Basically, the command was just like this python omark_contextualize.py fragment -m example_data/UP000005640_9606.omamer -o example_data/omark_output/ -f example_data/omark_contextualize_fragment.fa

Then, to trouble shoot, I also tested the Jupyter notebook file Contextualize_OMA.ipynb. Because I had some issue connecting my local browser to the Jupyter notebook on the server, so I ran the script within python on the server. Here, everything was smoothly until the following step where I encountered the same error 'Too many open files'. As a result, I did not get the final fasta file.

Extract uniq HOGs

uniq_HOGs = list(possible_fragments['HOG'].unique()) hog_to_medseqlen = {k: v for k, v in zip(possible_fragments['HOG'], possible_fragments['subfamily_medianseqlen'])} hog_genes = {} for hog, seq in zip(possible_fragments['HOG'],possible_fragments['gene']): glist = hog_genes.get(hog, []) glist.append(seq) hog_genes[hog] =glist hog_genes print(f'{len(hog_genes)} different HOGs')`

I later found out that it could only successfully finish for about more than 1,000 HOG sequences before getting that error, while I have a total of 2,657 uniq_HOGs. So, the way I worked around this is that I split the uniq_HOGs set into 3 subsets, and run each to write out 3 fasta files, and eventually concatenated the 3 fasta files into 1 for miniprot mapping step. For each subset, it needed to be in a fresh python terminal, or else, it would throw an error (same error as above).

Please let me know if you need any further information. Thank you very much. Nam

alpae commented 8 months ago

Hi @nam-hoang

indeed, it seems that the API client creates too many fresh sockets without properly cleaning them up. Fixing this requires a bit more time, but as a workaround you can simply increase the number of 'open files'. You can do this with ulimit -n in the shell before starting the python code. On linux systems, the default is usally 1024, so maybe just set it to 16000:

ulimit -n 16000
python utils/omark_contextualize.py ... 

We will try to fix this properly in the future in the omadb package.

nam-hoang commented 8 months ago

Thank you very much! Happy Holidays~ @alpae