Integrating Continuous Learning Approach to the A.I ecosystem.

reaganrewop commented 4 years ago

The objective of the CL is to continuously learn and improve the experience of the summaries based on the set of meetings that goes through a channel/group.

The Enrichment process happens for 2 components.

Channel Model.
Channel Mind.

Channel model contains files that are GPT related (helps to represent information based on the channel context) and Channel mind contains a set of files that keeps track of the importance of discussions that are discussed in the channel.

Each Channel has its own customized mind, whereas the mind can be common to more than one channel.

The Channel Mind contains the below artifacts

keyphrase entity graph: A structured representation of all the key phrases and entities that are important to the channel.
Entity Features: A dictionary of important entities and it's respective feature vector.
Entity community mapping: Discussions are clubbed together based on the entity's context.
Community global context ranking: Importance of community, based on all the meetings that have happened in the respective channel.
Community local context ranking: Importance of community, based on recent (5) meetings that have happened in the respective channel.

The Algorithm involves the below:

Extract groups from the meeting segments.
For each highlight, extract key phrases and extend it to entities using the keyphrase entity graph.
Using the extended entites, find the most similar entities w.r.t the highlights.
find which communities the entities belong to, for clubbing other entities based on similar discussion.
Find Community Agreeableness for a particular highlight based on the selected community mapping.
Check if the community is present in Local context ranking, if it does, use it's ranking else check in global context ranking. Local context ranking (recency) would always be given higher priority than the global context.
Update the local context, global context with freq of communities picked during that call and enrich keyphrase entity graph with entities and keyphrases.

To Do:

A new service needs to be created which would update all the artifacts after the respective call ends.
Segment Analyser service needs to be updated with gc, lc changes.
Managing S3 artifacts need to be done via API service. ie. copying artifacts when there is a new mind selected from a group.
Structure of segment analyser service response that is given to API service needs to be changed.
API service needs to call new lambda service to handle mind artifacts updation.

reaganrewop commented 4 years ago

There are 2 changes that need to be done from @etherlabsio/platform

Managing Mind Artifacts. Triggering the Artifacts_updater lambda after it received a response from segment_analyser lambda.

+ Managing Mind Artifacts.

As explained above, the Mind Artifacts gets updated after every call with respect to their group (context_id) and the mind that was selected (mind_id). When a new call is created with a particular mind attached, A.P.I service will check mind artifacts are already present for that particular combination of context_id and mind_id. If it is not present, it has to copy all the necessary default artifacts to the new location.

S3 mind artifacts location: artifacts/{env}/contexts/{context_id}/{mind_id}/ S3 default artifacts location: artifacts/{env}/minds/{mind_id}/ (Have to copy everything except model.bin)

This check needs to happen for every call (or, for every new mind selected in an existing group. )

+ Triggering the Artifacts_updater lambda

The new Artifacts_updater Lambda needs to be called with the following request format.

{ "Groups": Group, "mindId": String, "contextId": String }

Where Group is the response sent by the Segment_analyser service.

The response from the Artifacts_updater lambda will be Success/Failed.

karthikmuralidharan commented 4 years ago

@reaganrewop Just to clarify:

New Context Created (New Group Created)

User creates a new meeting
User specifies a domain mind id
The request is sent to the back-end.
the back-end needs to check on-demand if the context_id + mind_id combo has a path in s3 and copy the files before the first meeting starts ?
How big in size would the combined set of artifacts be during copy ?

Question on this approach:

Are we going to write to the S3 objects once per summary generation ? Or would it be continuously written ?
Based on the structure you've mentioned, we will be creating a new mind_id when the copy is done correct ?

S3 mind artifacts location: artifacts/{env}/contexts/{context_id}/{mind_id}/
S3 default artifacts location: artifacts/{env}/minds/{mind_id}/ (Have to copy everything except model.bin)

eg: we'd have to create a new mind_id in artifacts/{env}/minds to avoid clashes.

If a new entity is created, should it be linked with the parent domain mind ? @shashankpr I think this information might be needed to make useful connections in the graph in the future ?
What is the artifact_updater exactly doing here ? The naming is vague a little bit.
@shashankpr @reaganrewop, we'll be dynamically creating lambda configurations with the new mind_id as the identifier and passing the new mind_id that was created. Is that acceptable ?

vdpappu commented 4 years ago

Few clarifications:

Are we going to write to the S3 objects once per summary generation? Or would it be continuously written?

S3 artefacts (located in ././{context_id}/{mind_id}) are updated after every summary generation.

we'd have to create a new mind_id in artifacts/{env}/minds to avoid clashes.

@reaganrewop can you confirm this. Per our discussion, contextid/mind_id is the unique path and we need not create new mind_ids unless we have a new domain?

If a new entity is created, should it be linked with the parent domain mind?

New entities are specific to the context and shouldn't be populated back to the parent domain. For example, New entity DeLorean from Ether group shouldn't be populated back to Software Engineering Mind

I think this information might be needed to make useful connections in the graph in the future?

Regarding new entity, nothing changes in the current dgraph population approach. @shashankpr can confirm

What is the artifact_updater exactly doing here?

We can name it mind_updater to be specific- This service updates artefacts in the {context_id}/{mind_id} path for continuous learning after every Ether Call

We'll be dynamically creating lambda configurations with the new mind_id as the identifier and passing the new mind_id that was created. Is that acceptable?

Is this about the feature extractor lambda? If so, Per the current approach, we don't update the models - ie the model associated with each context is just a copy of the domain model and will not be updated. We are considering model updation for later stages. If we want to keep that ready, we can have lambda configuration one per each mind_id created.

@reaganrewop @shashankpr please add if I have missed any.

shashankpr commented 4 years ago

If a new entity is created, should it be linked with the parent domain mind ? @shashankpr I think this information might be needed to make useful connections in the graph in the future ?

This is a useful information but as @vdpappu mentioned the new entity is specific to the domain mind and may/may not be a lingo in the parent mind. In either cases, for now, no changes are required from the dgraph side. Eventually, if a new entity is created in the domain mind and if it gets reinforced with continuous learning, this new entity will start appearing in the summaries. When they appear in summaries, they will get populated to the dgraph (which has association of minds to parent mind). So, indirectly, we will have this information.

shashankpr commented 4 years ago

Also, this entire process for CL sounds as a good case for step functions. Maybe using it will give us a better control/view of the workflow. Just a thought...

vdpappu commented 4 years ago

Also, this entire process for CL sounds as a good case for step functions. Maybe using it will give us a better control/view of the workflow. Just a thought...

With two lambdas, Summary generator and mind_updater in sequence? We should be able to route the output from the first lambda to the platform and invoke the mind_updater in parallel.

shashankpr commented 4 years ago

We can do all of that using step functions. We can specify run-state as "ParallelExecution" or "Sequential" as required

vdpappu commented 4 years ago

It's sequential but the output from the first one should be redirected to API along with the second lambda. It should be possible. On the flip side, why we thought API should invoke the mind_update lambda is that it can listen to any failures. With step function, there is no such listener by default.

reaganrewop commented 4 years ago

the back-end needs to check on-demand if the context_id + mind_id combo has a path in s3 and copy the files before the first meeting starts ?

The meeting can still go on @karthikmuralidharan , Ideally, we expect the mind artifacts to be present before the end of the call, more specifically, before the A.P.I service issues a request to the segment_analyser service for computing groups.

How big in size would the combined set of artifacts be during copy ?

There are 2 files which vary around 30-50 Mb each, other than that, all of them are in Kbs'. Therefore, the total size might be around 100 Mb.

Are we going to write to the S3 objects once per summary generation ? Or would it be continuously written ?

For every meeting, we would upload and update the artifacts, once or twice at the maximum after the end of the call.

Based on the structure you've mentioned, we will be creating a new mind_id when the copy is done correct ?

No, @karthikmuralidharan . As @vdpappu mentioned, Initially, We aren't diverging from the parent mind and we don't have any pipeline right now that takes the decision to automatically train the model based on the data. Therefore the Model artifacts won't be copied, hence no need to create a new mindId.

For ex: If there is a new group created and have selected S.E mind, The following would be things we would do.

Check if the context_id is present in the S3 (in the below location.), if it is not present, create context_id and mind_id in the following location. artifacts/{env}/contexts/{context_id}/ and artifacts/{env}/contexts/{context_id}/{mind_id}
If the context_id is already present, Check for the presence of the mind_id in the following location. artifacts/{env}/contexts/{context_id}/{mind_id}
If it's not present, use the same mind_id which the user attached to the call and create the s3 location. Then copy all the files from the below S3 location which is the default domain mind location to the new location, which was described above. artifacts/{env}/minds/{mind_id}/ to artifacts/{env}/contexts/{context_id}/{mind_id}/ Note: No Change in MindId

What is the artifact_updater exactly doing here ? The naming is vague a little bit.

The artifacts, in this case, are the mind files. The task of artifacts_updater is to enrich/update all the artifacts based on the meeting information and the group information which was returned by Segment_analyser service.

@shashankpr @reaganrewop, we'll be dynamically creating lambda configurations with the new mind_id as the identifier and passing the new mind_id that was created. Is that acceptable ?

Can you explain a bit more on this @karthikmuralidharan . The back-end service would create (or update) a lambda for every request or for every new meeting call? Because we have to consider the warm-up time of these and also the model loading time. A warmed-up lambda can re-use the model without downloading it.

karthikmuralidharan commented 4 years ago

The artifacts, in this case, are the mind files. The task of artifacts_updater is to enrich/update all the artifacts based on the meeting information and the group information which was returned by Segment_analyser service.

In that case, mind_enricher would be an appropriate name. Artifact updating is simply the side effect. because of the design.

Note: No Change in MindId The same mind_id should not behave differently under different contexts. A mind_id in isolation should behave more predictably for the same input at a particular point in time. But with the approach you mention, that is not being guaranteed.

For the platform, the definition of mind is different. It's supposed to be a unique set of parameters (model_id + memory_id). So when we detect that the domain mind has diverged, we'll create a new mind_id internally, adding a connection to its parent domain mind_id and marking it as derived. Having a unique identifier per combination allows for certain optimizations I'll mention below.

Can you explain a bit more on this @karthikmuralidharan . The back-end service would create (or update) a lambda for every request or for every new meeting call? Because we have to consider the warm-up time of these and also the model loading time. A warmed-up lambda can re-use the model without downloading it.

A lambda config much like mind is immutable. We cannot per request change the same lambda identifier's configuration as they are designed to be stateless.

say, domain_mind SE has an id called A, we create a corresponding lambda for that called A.

Lambda A is associated with the SE model, SE memory files and other artifacts.

If there are two meetings for different contexts but SE mind happening simultaneously, we cannot alter the lambda configuration per request level. It's the same analogy as the same class instance being called by two different threads without using a shared mutex.

Instead, if we created a new mind_id say D for the context_id, we can create a new lambda function called D, that takes the configuration of the parent_mind and updates the env vars of certain paths but still have it point to parent's model artifact location.

So meeting 1 with ctx1 will call a lambda1 while mtg2 with ctx 2 will call lambda2.

reaganrewop commented 4 years ago

@karthikmuralidharan The files that need to be copied to the new S3 location are:-

entity_community_map.pkl
kp_entity_graph.pkl
entity.pkl
gc.pkl
lc.pkl
label_dict.pkl
sent_dict.pkl

Also, the new S3 location which we agreed on is Bucket: io.etherlabs.{env}.contexts path: {context_id}/minds/{mind_id}/

karthikmuralidharan commented 4 years ago

@etherlabsio/ml we have started work on this. @reaganrewop could you work with @trishanth on setting up the enrich_mind lambda?

reaganrewop commented 4 years ago

sure, @karthikmuralidharan .

etherlabsio / ai-engine

Integrating Continuous Learning Approach to the A.I ecosystem. #187