ctrl-space-labs / gendox-core

Gendox: "Generate. Train. Evolve."
GNU Affero General Public License v3.0
5 stars 0 forks source link

When the training job is executed do not create embeddings for already trained sections that are not updated. #351

Open myrtp opened 2 weeks ago

myrtp commented 2 weeks ago

Description

When Re-Triggering the Jobs for a previous period, dont request embeddings for sections that already have, unless stated otherwise.

Hints

Add DocumentInstanceSectionCriteria field and respective Predicate function, that selects only updated sections with no embedding. Embedding could exist, but for other/previous version of section. Eg.

Select sections with subquery:

select *
from gendox_core.document_instance_sections dis
where dis.id in (select dis1.id
              from gendox_core.document_instance_sections dis1
                       inner join gendox_core.document_instance di on di.id = dis1.document_instance_id
                       inner join gendox_core.project_documents pd on di.id = pd.document_id
                       inner join gendox_core.project_agent pa on pd.project_id = pa.project_id
                       inner join gendox_core.embedding_group eg on dis1.id = eg.section_id
              where pa.semantic_search_model_id = eg.semantic_search_model_id and
                  dis1.updated_at > eg.updated_at);

Criterion needs to be added in DocumentInstanceSectionPredicates

example on how to add subquery in predicate: DocumentInstanceSectionPredicates#project this ⏫ , when added in the query, selects all sections, of projects that have autoTraining=true

Similarly we want a subquery that, selects all sections, with embedding older than the respective section (undate_at)

myrtp commented 1 week ago

this will be implemented by adding SHA256 codes on document and embedding group