Closed luca-martial closed 1 year ago
@alierenak here are some ideas discussed with @chakravarthik27
fix_proportions = Augmentation(test_results=h.report())
# this is the rule-based stuff happening in the background
add_context -> 6% pass rate | min_pass_rate 65% -> FALSE
if pass_rate/min_pass_rate > 1:
return None
elif pass_rate/min_pass_rate > 0.9:
return 0.05
elif pass_rate/min_pass_rate > 0.8:
return 0.1
elif pass_rate/min_pass_rate > 0.7:
return 0.2
else:
return 0.3
fix_proportions.generate()
test_type | pass_rate_ratio | proportion_increase
add_context | 0.85 | 0.1
fix_proportions.save(data_path='train.conll', save_path='augmented_train.conll', optimized_inplace=True)
shall I ignore this "-DOCSTART- -X- -X- O\n"
@alierenak @chakravarthik27 for this sprint's implementation, let's forget about DOCSTART and POS tags since it looks like they are not currently handled by the datahandler module.
For now, POS tags will all be NN NN
by default and DOCSTART will be totally ignored
@alierenak @chakravarthik27 the new decided approach is the following:
from nlptest import Harness
# Create your Harness
harness = Harness(task='ner', model='ner', data='test.conll', hub='johnsnowlabs')
# Generate a report of your test cases
tests = harness.generate().run().report()
# Augment your training set based on results from first harness, default inplace is True
tests.augment(input_path='train.conll', output_path='augmented.conll', inplace=True)
# Get a report of your augmentations that looks like the screenshot attached
tests.augmentation_report()
# Train your model based on augmented dataset
fitted_pipe = nlp.load('ner').fit(augmented.conll')
# Use the configs and generated sentences from the first harness .generate() method
# so that we can make a fair comparison with the first harness test results
harness_new = ...
# Generate a report of your new harness, no need to run `.generate()`
tests = harness_new.run().report()
Screenshot of .augmentation_report()
:
Hi @alierenak @luca-martial @ArshaanNazir ,
from my side, I have one idea about saving context, when we think about different formats of datasets, augmated_data also saves with respect to the given input format, right? so, we need to implement save functionality in datahandler classes only.
Hi @alierenak @luca-martial @ArshaanNazir ,
from my side, I have one idea about saving context, when we think about different formats of datasets, augmated_data also saves with respect to the given input format, right? so, we need to implement save functionality in datahandler classes only.
@chakravarthik27 yes augment should always save the output file in the same format as the input file
This should be added for all test types from the legacy nlptest library.
Example: Harness(...)
results -> fix = augmenting training dataset
Augmentation(results).to_conll(path) or .to_csv(path)