NCATS-Gamma / robokop

Master UI for ROBOKOP
MIT License
15 stars 3 forks source link

Unable to build robokop-kg from scratch #493

Closed Prakash2403 closed 4 years ago

Prakash2403 commented 4 years ago

I wanted to see how data is ingested in neo4j. Hence, I started with empty neo4j database, ran docker exec $(docker ps -f name=interfaces -q) bash -c "source robokop-interfaces/deploy/setenv.sh && robokop-interfaces/initialize_type_graph.sh" to setup initial database. Everything went fine till here.

Now, let's say I want to add chembio data(the corresponding service is available and is present in greent/core.py. How do I do that?

Edit: Upon inspection, I found two scripts which can do this:

  1. runthemall.sh: Throws some error, opened an issue for that #492
  2. crawl_all.py: Located in robokop-interfaces/crawler/crawl_all.py. Ran the following code: python crawl_all.py -sv chembio. Seems like it downloads a lot of data, but nothing gets pushed to neo4j.
YaphetKG commented 4 years ago

@Prakash2403 , posted some changes for builder/writer.py . I think the issue was that the threads created by the writer process upon message arrival from the broker queue seem to be smaller that it expects and since subsequent messages are delivered to different threads, I think on slow queue deliveries this is an issue. Thanks for bringing this up.

Prakash2403 commented 4 years ago

@YaphetKG I think the problem still exists. Ran python crawl_all.py -sv chembio two hours ago, still no entry in neo4j. Also, any updates on runthemall.sh?

Thanks for the help.

Edit: Looked at the logs. Found a lot of lines like this:

Failed to get response from https://onto.renci.org/synonyms/ENSEMBL:ENSG00000101440. Status code 500 Is this an issue?

Prakash2403 commented 4 years ago

@YaphetKG

Tried to build with python crawl_all.py -sv biolink.

Since I just wanted to build a PoC, I did some changes to the code.

  1. Added break statement below this line
  2. Added break statement after every lids.append in this function.

What I thought would happen?

As soon as it founds one valid data point for a given input_type, it will process it and insert it in neo4j.

What actually happened?

As per the logs, processing went fine. But no data shows up in neo4j.

Also, can you kindly tell me the steps to build and populate DB from scratch?(apart from the ones mentioned in README)

EDIT: Looked at the code for BufferedWriter class inside robokop-interfaces/greent/export.py. Seems like it makes an update when node_buffer_size is 100. Am I right?

EDIT 2: Worked for biolink :). Removed all break statements and reduced the size of buffer to 5. Worked like a charm. Not sure for chembio though.