Issues with custom data

Rothamsted / knetbuilder

KnetBuilder data integration platform for building knowledge graphs. Previously known as ondex.

https://knetminer.com

MIT License

12 stars 11 forks source link

Issues with custom data #35

Closed man4ish closed 3 years ago

man4ish commented 3 years ago

I have created poplar data based on potato tutorial data format. I am able to create oxl file using

./ondex-mini/runme.sh /var/www/html/knet/poplar_data/workflow.xml "baseDir=/var/www/html/knet/poplar_data"

All inputs and output (kg_final.oxl) are hosted at http://ec2-18-225-37-206.us-east-2.compute.amazonaws.com/knet/ temporarily.

How to verify kg_final.oxl is ok?

As I tried to query network using knetminer and did not work. Attached is log file(ws.log) for run.

marco-brandizi commented 3 years ago

That's not easy to do. We have plans to develop a small Ondex plugin which would be able to invoke Groovy scripts to run automated tests against the OXL graph.

For the moment, we do a number of things:

We use the Ondex Desktop application to load the graph and do things like searching nodes and relations. Which works, but requires the usual large quantity of RAM
We write integration tests like this. As you can see, this is based on Java and the OXL/Ondex interfaces that are used everywhere in our software.
I convert to RDF, reload into a triple store (eg, Jena Fuseki) and issue SPARQL queries. The same can be done with Neo4j/Cypher. We have converters for that in the Download page.

KeywanHP commented 3 years ago

The problem is a taxid mismatch between what you have set in your workflow.xml (fastagff parser --> taxid:4113) and what is defined in the knetminer poplar dataset (taxid:3694).

I think the poplar taxid is 3694, so it would be best to correct your workflow.xml and rebuild the KG. Otherwise also note that the poplar knetminer dataset is a fairly old and we may need to review the semantic motifs to ensure correctness but it should be ok as a proof of concept.

man4ish commented 3 years ago

Can you please explain about the fields in compara.txt as there is no header info.

First file from tutorial_data/compara.txt (https://knetminer.com/tutorial/knetbuilder/tutorial-data.zip)

ATMG00030 ATMG00030.1 arabidopsis_thaliana 58.8785 ortholog_one2one PGSC0003DMG400019855 PGSC0003DMT400051118 solanum_tuberosum 42.2819 NULL NULL NULL 0.00 0 114308117

Also i found that above format is different from what is explained here https://github.com/Rothamsted/knetbuilder/wiki/Building-Knowledge-Networks#ensembl-compara-data

KeywanHP commented 3 years ago

I have updated the wiki to match the tutorial data. The compara-config.xml file describes the columns that are parsed and transformed into a (Protein)-[:ortho]->(Protein) graph. Which species are you interested in building a knowledge graph for? We maybe able to help.

man4ish commented 3 years ago

I am interested in building knowledge-graph for Populus Trichocarpa v3.1 from Phytozome. Right now we are preparing data.

man4ish commented 3 years ago

May i know what is the criteria for choosing orthologs, it is based on % identity (col 4 or col 9)?

Above is plot for % identity(column 4) for compara.txt (potato data) but there is no such distribution to find cutoff for predicting orthologs.

KeywanHP commented 3 years ago

The compara data comes from a sophisticated Ensembl pipeline which is beyond my expertise. But my understanding is that all relations in the compara file are predicted orthologs (ie there is no need for further filtering). The sequence identity is something we show as additional evidence to the user but we don't use it as a filter or in our KnetScore.

Were you able to retrieve similar homology data for poplar from Phytozome? We could have a call to discuss your project requirements if you like.

man4ish commented 3 years ago

It would be great to have a zoom meeting to discuss our project goals and the bottlenecks we are facing. Please let us know what time works for you and also where to send zoom meeting details. My email is : mkumar10@utk.edu

marco-brandizi commented 3 years ago

I'm closing this, possibly let's use the other channels to keep discussing the mentioned developments.