CollectiWise / owlab

The first repository that we will track with CollectiWise
0 stars 0 forks source link

How to do the initial data transfer of the basis data (first data base before users add more) #8

Open jac2130 opened 4 years ago

jac2130 commented 4 years ago

Here I will make some suggestions and statements about what definitely should be the case for the first issue in integrating the two major parts of CollectiWise which each other. @romanradin There are two different elements to integration between the ConceptNet part of CollectiWise and the Opencog part of CollectiWise. The first one is the initial data before there are any users and then there is the part where users enter new data, which is really different!

For the first part there isn't any need for Implications; only Concepts, Relations, Relationships, Predicates and Attributes and it should be in this order. The second part when users are adding data has to be in real time because users must get immediate feedback to what they are doing. One of the most relevant functions is lookup_cases but also the fact that when they make a new statement the points for making that statement must be immediately deducted and points are handled withing the Opencog part of the system.

For loading in the initial data, computer scientists call this the Base Ontology so I will refer to the initial data as the base ontology and to what the users are putting in as the user defined statements. I hope that this won't cause confusion!

It is also very important that the Base Ontology can be a different data set, such as this one which interests the UN or another completely different one, for example this one ; the tables are going to be the same though.

So for sure, everything should be done on the Google Cloud and we should use whatever Google Cloud services that are best suited for the given task. The first task is to run the scripts on some Google service that is built in particular for the purpose to transfer massive data (note that the base ontology we are using is small compared to what it potentially could be!) from one relational data base to the AtomSpace via our API. This should definitely be done using the most cutting edge tools that Google provides and not using something that works but takes forever to do; that is precisely the point of using Google services and we must make use of their best stuff! To begin with, both APIs should be on Docker containers and they must definitely be on a Kubernetes orchestration system; that was part of Scope one of this project and this will most definitely be necessary for it to be a scalable system and that it must be; no serious customer like the UN, which I'm trying to get will forgive it not being scalable or slow in any way or failing because it isn't fault line tolerant; this is an absolute must of this particular system. Then, we need to use some method that Google has built to do the first integration task which is the base line ontology transfer from the ConceptNet database, which should be in Cloud Spanner to the AtomSpace, which should also be persistent on Cloud Spanner (Cloud Spanner is definitely the right data base service as I've studied this issue for some time now and it is optimized for the use cases that we need). We need definitely to run these scripts in parallel and I think but I'm not sure that PySpark and DataProc are the right solution. Another possible solution involves DataFlow ...there might be some other Google Cloud services that are better suited for this task but it must be built for massive data transfers using API calls on parallel using the API on Kubernetes.

romanradin commented 4 years ago

will discuss this with the dev a little later

jac2130 commented 4 years ago

This might be relevant!

jac2130 commented 4 years ago

Also this: https://github.com/gregsramblings/google-cloud-4-words/blob/master/DarkBrochure.pdf

romanradin commented 4 years ago

@jac2130 our initial plan was to use the native django tools as an interlayer between SQL DB & the mechanism responsible for queries sending. these tools are very cool & powerful. they are optimized to work with big databases. if we use django tools all we will need is a usual virtual machine with a powerful processor.
now we're told to use clusters. google clusters use the Hadoop system that doesn't interact with the operating system. they interact with Haddop only using such tools as Apache spark. Apache spark is a big framework to work with big data. Haddop & Apache spark are not the tools that our developer uses every day. that's why he needs the time to discover them. so, the situation is: the functions that we created before won't work with clusters. so, they need to be created almost from scratch.

question1: is it really worthy to use clusters for a one time job?

to use the virtual machine variant we just need 3 things:

question2: why not to manage this Atomspace problem that we're having for 2-3 weeks?

romanradin commented 4 years ago

For the first part there isn't any need for Implications; only Concepts, Relations, Relationships, Predicates and Attributes and it should be in this order.

we have the functions to move this data already. but if you insist on using clusters then we have to create them almost from scratch

romanradin commented 4 years ago

The second part when users are adding data has to be in real time because users must get immediate feedback to what they are doing. One of the most relevant functions is lookup_cases but also the fact that when they make a new statement the points for making that statement must be immediately deducted and points are handled withing the Opencog part of the system.

this can't be done still as we have the problems with opencog

jac2130 commented 4 years ago

question1: is it really worthy to use clusters for a one time job?

Well, so one of the features that are important is merging of Ontologies as in this tool, named pronto but also with JSON-LD not just with owl, obo and xml with bigger data and including Opencog.

This means that it really isn't a one time thing: it must be possible to merge ontologies (meaning that it must be possible to feed in more big data later on) and that this ought to go pretty fast. If you play around a bit with pronto you see what I mean.

Remember that the problem is that it will take 25000 hours (maybe a bit less with Google Compute Engine but not much less, in the way that you've built it? That would be a serious problem.

The other thing that I can do is that you make it work now with Django somehow and Google Compute Engine and that I later hire someone to rewrite everything in PySpark. Or I do it myself with Scala Spark which is actually better (because Spark is native to Scala and Scala is fully functional) but no one knows Scala; in that case I'll just follow your scrips and copy them into Scala Spark and that is what I did at eBay anyway. So I could do that. But then your solution must work faster than 25000 hours somehow?

to use the virtual machine variant we just need 3 things:

  • set up Atomspace

  • set up a virtual machine in google cloud environment

  • run it & wait untill its job is done.

question2: why not to manage this Atomspace problem that we're having for 2-3 weeks?

I haven't had a chance yet because I'm trying to get money as much as possible and as fast as possible so I have applied for a loan where I need to do a lot of work with budget forecasting etc and that is still not done. I have already applied for the grant but I won't hear anything back until sometime in May about that and I'm trying to get money faster than that; like withing two or three weeks time ...so I still have to do work there. Unfortunately I don't have anyone doing any of that for me yet. There are also some contracts that I'd like to get (one with the UN and some other ones with the US government) but for that I need time and also a demonstrable system ...but I first need time and call them a bunch of times etc ...so in short, I haven't had the time to solve that AtomSpace problem and I don't yet have the cash to have someone else solve it. For some reason I haven't heard anything back from Nil and the opencog people in a while either.

romanradin commented 4 years ago

Remember that the problem is that it will take 25000 hours in the way that you've built it?

seems like you didn't understand us. 2500 hour was the approx ETA for the case in which we use local laptop instead of cloud services. our current idea is to use a virtual machine in google cloud environment

jac2130 commented 4 years ago

Remember that the problem is that it will take 25000 hours in the way that you've built it?

seems like you didn't understand us. 2500 hour was the approx ETA for the case in which we use local laptop instead of cloud services. our current idea is to use a virtual machine in google cloud environment

OK can you tell me how long that would take, approximately?

romanradin commented 4 years ago

OK can you tell me how long that would take, approximately?

the scheme that includes Google compute engine + our Python scripts will make this data transfer job in not more than a week. this is a very approximate ETA

romanradin commented 4 years ago

I haven't had the time to solve that AtomSpace problem and I don't yet have the cash to have someone else solve it.

anyway you have to remember that your system won't be fully working system without a fully working Atomspace

jac2130 commented 4 years ago

A week is pretty long still but sure ...lets try this! In the meantime, could you send me those scripts and I'll play around with them whenever I have time and see if it would take me very long to rewrite them in Spark?

romanradin commented 4 years ago

ok. will do

romanradin commented 4 years ago

so, this becomes a priority image

since you decided that we can work according to our scheme

jac2130 commented 4 years ago

yes indeed ...if it won't take me too long to rewrite your script it might make sense for me to do those two things in one go though

jac2130 commented 4 years ago

it'd be better to run this all in DataProc and see if we can do this quickly (run the whole thing in under one hour)

romanradin commented 4 years ago

it'd be better to run this all in DataProc and see if we can do this quickly (run the whole thing in under one hour)

have you seen this? image

so, you told us that we won't use clusters for this job

jac2130 commented 4 years ago

yes but if I could do it quickly maybe at least I should try real quick ...if it looks prohibitively complex then I agree that we should do it this way first

jac2130 commented 4 years ago

if it takes one week to run then maybe working on it for 5 hours to run it under one hour is worth it?

jac2130 commented 4 years ago

...just a thought

romanradin commented 4 years ago

let me remind you what's the current Atomspace problem:

jac2130 commented 4 years ago

One week means that I can't demo it that quickly ...I'd like to show the working system to some people

jac2130 commented 4 years ago

One week means that I can't demo it that quickly ...I'd like to show the working system to some people

Definitely no matter how we run this thing, we must first solve the AtomSpace issue! I agree with you on that! 100%

romanradin commented 4 years ago

remaking our functions to work with clusters may take more than one week

jac2130 commented 4 years ago

All I'm saying is that if it is super easy for me to rewrite the code in Spark I should do both, fix the AtomSpace issue AND rewrite the thing

jac2130 commented 4 years ago

and then run the thing

jac2130 commented 4 years ago

remaking our functions to work with clusters may take more than one week

Maybe and if it looks that way I will not do it now; but I'm pretty fast in Scala coding

romanradin commented 4 years ago

https://www.dropbox.com/s/li4gxihdh8lp6bs/commands.rar?dl=0 here are our functions

jac2130 commented 4 years ago

Thank you! Ok so today I have a call at 3PM about that loan and depending on how much work I have to do on that loan, I'll start working on the AtomSpace issue right away ...then I'll check quickly if it looks like a lot or a little work to rewrite this code in Scala and depending on that I'll let you know. In the meantime, it would help to work on the real time user interactions with the Opencog system and for that, regular python (Django) code will definitely work; that way we can work in parallel?

jac2130 commented 4 years ago

Maybe it would help to discuss the next step which includes users to call the API from the front-end?

romanradin commented 4 years ago

Maybe it would help to discuss the next step which includes users to call the API from the front-end?

it's possible that smth may break on our side if we try to do the next steps before the interaction between the scripts & Atomspace works well

jac2130 commented 4 years ago

From our Slack discussion; I don't understand this:

  1. Run such command using the previously copied path to the container https://prnt.sc/rva6ag

Does it have to do with attaching volumes? How do you connect the container to a database and what database? I know that when you set up things you have to reach out of the container to a persistent database; where are you doing it and how? Note: I'm not using this container (https://prnt.sc/rva6ag); I'm using the one I made and that I know worked and then I'm cloning the owlwise git repo. I'm following the last path of the options you told me about; using the image that I've built myself so that I can go back and rebuild and I know it works (gcr.io/collectiwise/flask-oc@sha256:24be2ed9ad1bb11e846894c3a1c7bcd51e1cb317313932c13bf53b8d632f603c):

Clone the repo to the environment with a working OpenCog system and run these commands: cd owlwise pip install -r requirements.txt python manage.py runserver

So, I've tried getting the database Scheme error and I didn't get that error at all. But I don't know how to probe this thing at all ...I can't tell whether it is attached to a persistent database or not.

DatabaseAttachingNoErrors

jac2130 commented 4 years ago

Do I need to set up a volume ...how do I connect this sort of database to the docker container?

CloudSpanner

jac2130 commented 4 years ago

I ndon't really know how this Django server works or how to probe it to see if I can get the same error you were getting: djangoManage

jac2130 commented 4 years ago

So now I'm getting this error which I shall be eliminating since it happens in my code: alphabet_cycle_error

jac2130 commented 4 years ago

So now it's been on the following screen for some time! waitingToConnect

jac2130 commented 4 years ago

Okay so now I have un-commented the problem code and I still didn't get the error; I got some other strange behavior: RelationExistsSkipping

jac2130 commented 4 years ago

@romanradin do you see the last message above? What am I not doing? I'm not getting the error that you got!

romanradin commented 4 years ago

Okay so now I have un-commented the problem code and I still didn't get the error; I got some other strange behavior:

the key error here is that opencog functions don't let the django thread start

romanradin commented 4 years ago

Okay so now I have un-commented the problem code and I still didn't get the error; I got some other strange behavior:

@jac2130 the key error here is that opencog functions don't let the django thread start. because after the things we see at your last screenshot we had to see some django logs, but we don't see them

jac2130 commented 4 years ago

So how do I see the error message for that @romanradin ? I need to see where and how it fails so that I can go to the offending code and fix the problem! This should not fail quietly ...there must be an informative error message?

jac2130 commented 4 years ago

OK so there is interference between Django and the SQL done in Scheme. What database are we trying to connect to? And are we trying to connect to the database using Django?! That would make things certainly slow! @romanradin

jac2130 commented 4 years ago

This is the picture we really need: Basically we can follow the example of the script that Vitaliy created in Scope 1 and mount a volume for OpenCog the same way that it is done for ConceptNet in the Kubernetes code. Then we have a Kubernetes Cluster with K worker nodes for the OpenCog API which then are all mounted to the same volume database, which should simply be a Google Cloud Spanner instance.
CNtoOP

jac2130 commented 4 years ago

The most important part @romanradin is that a container should talk with the frontend and with the source data via a Django API, that is perfectly fine, but it should not ever talk to its sink database using Django! It should instead use volume mounting. https://docs.docker.com/storage/volumes/ ...this is very important and might be the reason the code is breaking currently ...there should be no problem with using SQL otherwise.

romanradin commented 4 years ago

@jac2130 I need to discuss all your comments from this card with Vitaliy & our backend developer

jac2130 commented 4 years ago

OK sounds great; lets get the ball rolling!

jac2130 commented 4 years ago

Here is an old system diagram as to how things should work:

SystemDiagram