I could be doing something wrong, but I've come across what appears to be a bug in the initialisation of generate_contexts method.
Each time this is called, contexts (and others) are being initialised to []. I believe that at this point self.total_chunks should be reset back to zero, because otherwise at the end of this routine "self.total_chunks += num_chunks" is executed which ends up adding num_chunks so the reported total chunk count appears to grow when I'm simply resampling the existing document content.
Here is my code:
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import FiltrationConfig, ContextConstructionConfig
# Generate the default set of goldens (ie randomly from the full set of evolution types)
filtration_config = FiltrationConfig(critic_model="gpt-4o-mini")
synthesizer = Synthesizer(model="gpt-4o-mini",filtration_config=filtration_config)
context_construction_config = ContextConstructionConfig(critic_model="gpt-4o-mini")
synthesizer.generate_goldens_from_docs(
document_paths=['./data/datafile.txt'],
context_construction_config=context_construction_config # params to customise quality of contexts constructed from documents
)
[FYI this reported "Utilizing 3 out of 17 chunks."]
# Now add another 10 REASONING questions randomly generated from the available contexts
# Force creation of a REASONING question type
from deepeval.synthesizer.config import EvolutionConfig
from deepeval.synthesizer.types import Evolution
evolution_config = EvolutionConfig(evolutions={Evolution.REASONING: 1})
synthesizer.evolution_config=evolution_config
# Also change to 10 contexts per document
context_construction_config.max_contexts_per_document=10 # This will create 10 new questions given max_goldens_per_context will be also set to 1
# NB: *** Repeat this contexts extraction step if wanting a different randomly selected context from source doc,
# otherwise don't repeat and it will generate new question against SAME set of context(s). ***
# Generate the required contexts from already loaded "context_generator" content
from itertools import chain
synthesizer.context_generator.total_chunks=0 # Reset context_generator.total_chunks (bug when generate_contexts called more than once for same doc)
contexts, source_files, context_scores = (
synthesizer.context_generator.generate_contexts(num_context_per_document=context_construction_config.max_contexts_per_document)
)
print(f"Utilizing {len(set(chain.from_iterable(contexts)))} out of {synthesizer.context_generator.total_chunks} chunks.")
Note my code "synthesizer.context_generator.total_chunks=0"
[FYI this reported "Utilizing 10 out of 17 chunks." but only because I'd reset synthesizer.context_generator.total_chunks to 0
If I hadn't reset it, it would have reported "Utilizing 10 out of 34 chunks."]
# Make actual call for REASONING questions
# Now call the generation process and this automatically adds to existing set of synthesizer.synthetic_goldens
synthesizer.generate_goldens_from_contexts(
max_goldens_per_context=1, # default is 2
contexts=contexts,
source_files=source_files # Optional (undocumented): This is just for returned data purposes in new goldens (ie gets included when saved)
)
and then if I had wanted to generate say 4 new random chunks so I could generate 8 new CONCRETIZING questions
(2 qns per new chunk) it would have reported "Utilizing 4 out of 51 chunks." instead of "Utilizing 4 out of 17 chunks." by including the 0 reset line in the new code - see below:
evolution_config = EvolutionConfig(evolutions={Evolution.CONCRETIZING: 1})
synthesizer.evolution_config=evolution_config
context_construction_config.max_contexts_per_document=4
synthesizer.context_generator.total_chunks=0 # Reset context_generator.total_chunks (bug when generate_contexts called more than once for same doc)
contexts, source_files, context_scores = (
synthesizer.context_generator.generate_contexts(num_context_per_document=context_construction_config.max_contexts_per_document)
)
print(f"Utilizing {len(set(chain.from_iterable(contexts)))} out of {synthesizer.context_generator.total_chunks} chunks.")
# Make actual call for CONCRETIZING questions
# Now call the generation process and this automatically adds to existing set of synthesizer.synthetic_goldens
synthesizer.generate_goldens_from_contexts(
max_goldens_per_context=2, # default is 2
contexts=contexts,
source_files=source_files # Optional (undocumented): This is just for returned data purposes in new goldens (ie gets included when saved)
)
I could be doing something wrong, but I've come across what appears to be a bug in the initialisation of
generate_contexts
method. Each time this is called, contexts (and others) are being initialised to []. I believe that at this pointself.total_chunks
should be reset back to zero, because otherwise at the end of this routine "self.total_chunks += num_chunks" is executed which ends up adding num_chunks so the reported total chunk count appears to grow when I'm simply resampling the existing document content.Here is my code:
[FYI this reported "Utilizing 3 out of 17 chunks."]
Note my code "synthesizer.context_generator.total_chunks=0" [FYI this reported "Utilizing 10 out of 17 chunks." but only because I'd reset synthesizer.context_generator.total_chunks to 0 If I hadn't reset it, it would have reported "Utilizing 10 out of 34 chunks."]
and then if I had wanted to generate say 4 new random chunks so I could generate 8 new CONCRETIZING questions (2 qns per new chunk) it would have reported "Utilizing 4 out of 51 chunks." instead of "Utilizing 4 out of 17 chunks." by including the 0 reset line in the new code - see below: