issues running master branch of the project

mahdizar commented 6 years ago

Hello everyone!

Lately I have been trying to run the project in a Ubuntu 16.04 environment and I am facing the following issues:

In case I run the hub with git+https://github.com/biothings/biothings.api.git@1c3227c397250daaeef387be743d9e60c26d9bdd#egg=biothings that is in the requirements, I get the following error: ImportError: cannot import name 'ThrottledESJsonDiffSyncer' and I had to use the master branch of the biothings library
During a manual merge of the following configuration: { "_id" : "mygene", "name" : "mygene", "sources" : [ "cpdb" ], "root" : [ "cpdb" ], "doc_type" : "gene" }, I get the following error: AssertionError: Expecting 2 collection _ids, got: ['mygene_20171102_6qat8yzn']
During a manual index of the following build: mygene_20171102_6qat8yzn, I get the following error: NameError: name 'INDEXER_CATEGORY' is not defined
During a scheduled dump of pharmgkb resource, I get the following error: ERROR:pharmgkb_dump:Error while dumping source: stat: can't specify None for path argument
During a scheduled dump of homologene resource, I get the following errors:
- ERROR:homologene_batch_1:Can't find Entez data directory
- ERROR:homologene_dump:Error while dumping source: EOFError
During a scheduled dump of exac resource, I get the following errors:
- ERROR:exac.broadinstitute_exac_batch_1:Can't find Entrez data folder
- ERROR:exac_dump:Error while dumping source: 550 fordist_cleaned_exac_nonTCGA_z_pli_rec_null_data.txt: No such file or directory
During a scheduled dump of entrez resource, I get the following error: ERROR:entrez.entrez_genomic_pos_batch_1:[Errno 2] No such file or directory: '/home/mahdi/mygene.info.tmp/data/entrez/20171028/../ref_microbe_taxids.pyobj'
During a scheduled dump of refseq resource, I get the following error: ERROR:refseq_dump:Error while dumping source: EOFError
During a scheduled dump of generif resource, I get the following error: ERROR:generif_dump:Error while dumping source: EOFError
During a scheduled dump of ensembl resource, I get the following errors:
- ERROR:ensembl.ensembl_acc_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_acc_upload:failed [steps=data,post,master,clean]: Can't find Entrez data folder
- ERROR:ensembl.ensembl_gene_batch_1:list index out of range
- ERROR:ensembl.ensembl_gene_upload:failed [steps=data,post,master,clean]: list index out of range
- ERROR:ensembl.ensembl_genomic_pos_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_genomic_pos_upload:failed [steps=data,post,master,clean]: Can't find Entrez data folder
- ERROR:ensembl.ensembl_interpro_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_interpro_upload:failed [steps=data,post,master,clean]: Can't find Entrez data folder
- ERROR:ensembl.ensembl_pfam_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_pfam_upload:failed [steps=data,post,master,clean]: Can't find Entrez data folder
- ERROR:ensembl.ensembl_prosite_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_prosite_upload:failed [steps=data,post,master,clean]: Can't find Entrez data folder
During a scheduled merge, I get the following error: ERROR:asyncio:Exception in callback Cron.set_result(<_GatheringFu... 'mygene'",)]>) handle: <Handle Cron.set_result(<_GatheringFu... 'mygene'",)]>)>: No such builder for 'mygene'
Finally, could you please provide me with an explanation on how the hub schedules and performs the index operation automatically?

You can find a more detailed trace of these issues in the file below: Errors of mygene.info.txt

Looking forward to hearing from you soon.

Thanks in advance!

sirloon commented 6 years ago

That's a lot of errors :)

I'm in the middle of a feature dev, some code is broken... Anyway don't use the one from the requirements, use the latest, or maybe this commit: 61ca3dacf2de8ec1f6b6bf59a20a5ac54be47a6d . I'll try to fix it asap
AssertionError: Expecting 2 collection _ids, got: ['mygene_20171102_6qat8yzn']: the hub tried to diff 2 collections automatically, but you only have one currently. If you build a new one, that should work
'INDEXER_CATEGORY' is not defined: check if it's working with the biothings commit I gave you
stat: can't specify None for path argument: you miss something in the config file, have you defined DATA_ARCHIVE_ROOT or something like that ?
fordist_cleaned_exac_nonTCGA_z_pli_rec_null_data.txt: No such file or directory: FTP error 550, there's something wrong with exac FTP server. I saw that error before, re-running the command sometimes helps, but there's soemthing weird there happening
No such builder for 'mygene': you must have a "mygene" src_build_config document to run this.
Other errors, mostly coming from the fact you need to run entrez first, as other resources depend on Entrez's data.

HTH

sirloon commented 6 years ago

Hi again, I updated mygene repo to include the missing ref_microbe_taxids.pyobj file. It's in data/ directory, have a look at the README for instructions. Let me know if you need more help.

mahdizar commented 6 years ago

Hi sirloon,

Thank you very much for answering my comment.

I switched to the commit 61ca3dacf2de8ec1f6b6bf59a20a5ac54be47a6d of the biothings library but I got issues in the indexmanager definition in hub.py and I kept having an error after the manual index, so I kept using the master version of the library
Regarding the error stat: can't specify None for path argument, yes the DATA_ARCHIVE_ROOT is defined in the configuration file of the hub
Regarding the order of execution of the sources, is it possible to set the order for the scheduled tasks? Is it actually possible to configure the hub so that it schedules the tasks of entrez before any other source?

Massive thanks for updating the repo and sure I will be back to you if I need more help.

Kind Regards

sirloon commented 6 years ago

You're right, better use the master HEAD branch now, I fixed few things you should be good to go. Regarding the schedule, dumpers have their own definition, like this: https://github.com/biothings/mygene.info/blob/master/src/hub/dataload/sources/pharmgkb/dump.py#L19 It's a crontab-like notation. We've set those to build our mygene.info but you can change these if you want. It's not ideal though (as it's in the code) and this will eventually be taken from a database.

Best

mahdizar commented 6 years ago

Hello sirloon,

Yes, I modified the value of the "SCHEDULE" parameter in the dumper and it works. Thank you very much.

For now, I am trying to schedule an index operation but I have a problem executing repeatedly index_manager.index.

First, I get the following error when I try to override an index:

File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site-packages/biothings/hub/dataindex/indexer.py", line 374, in index raise IndexerException("Index already '%s' exists, (use mode='purge' to auto-delete it or mode='resume' to add more documents)" % index_name) biothings.hub.dataindex.indexer.IndexerException: Index already 'bao_current' exists, (use mode='purge' to auto-delete it or mode='resume' to add more documents)

Second, I get the following errors if I try the mode "purge" or "resume":
- When I set the "purge" mode, I get the following error:
ERROR:asyncio:Exception in callback IndexerManager.index..indexed(<Task finishe...hes failed',)>) at /home/mahdi/opt/hub.mygene.info/lib/python3.5/site- packages/biothings/hub/dataindex/indexer.py:115 handle: <Handle IndexerManager.index..indexed(<Task finishe...hes failed',)>) at /home/mahdi/opt/hub.mygene.info/lib/python3.5/site- packages/biothings/hub/dataindex/indexer.py:115> Traceback (most recent call last): File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run self._callback(*self._args) File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site- packages/biothings/hub/dataindex/indexer.py", line 116, in indexed res = f.result() File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step result = coro.send(None) File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site- packages/biothings/hub/dataindex/indexer.py", line 451, in index raise got_error Exception: Some batches failed

and the index is left empty in elasticsearch.
- When I set the "resume" mode, I get the following error:
ERROR:index_bao_current_bao_current_batch_1:search() got an unexpected keyword argument 'fields' Traceback (most recent call last): File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site- packages/biothings/hub/dataindex/indexer.py", line 567, in indexer_worker es_ids = idxr.mexists(ids) File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site-packages/biothings/utils/es.py", line 60, in outter_fn return func(*args, *kwargs) File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site-packages/biothings/utils/es.py", line 117, in mexists res = self._es.search(index=self._index, doc_type=self._doc_type, body=q, fields=None, size=len(bid_list)) File "/home/mahdi/opt/hub.mygene.info/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped return func(args, params=params, **kwargs) TypeError: search() got an unexpected keyword argument 'fields'

The execution with the "resume" mode is failing in the master HEAD commit in both cases: with an already created index and with an empty elasticsearch.

Looking forward to hearing from you soon.

Thanks in advance!

sirloon commented 6 years ago

Hi, recently we've been working on providing standalone instances, ie. Docker images you can download and use to setup your own BioThings API (to typically run your own mygene.info API, on your own hardware, which it seems what you'd like to do). See there for more: http://docs.biothings.io/en/latest/doc/standalone.html. These instances keep data up-to-date by directly download the data we release each week for mygene.info. There's no such dumper/uploader/... it's just the final "compiled/merged" data.

Though it's still under development, it may be interesting for what you try to achieve. Let me know if you're interested. You can reach us at help@mygene.info to discuss this further.

For now I close the issue.

biothings / mygene.info

issues running master branch of the project #25