JohnSnowLabs / langtest

Deliver safe & effective language models
http://langtest.org/
Apache License 2.0
496 stars 39 forks source link

Add a fixing functionality in the library #92

Closed luca-martial closed 1 year ago

luca-martial commented 1 year ago

This should be added for all test types from the legacy nlptest library.

Example: Harness(...)

results -> fix = augmenting training dataset

Augmentation(results).to_conll(path) or .to_csv(path)

luca-martial commented 1 year ago

@alierenak here are some ideas discussed with @chakravarthik27

fix_proportions = Augmentation(test_results=h.report())

# this is the rule-based stuff happening in the background
add_context -> 6% pass rate | min_pass_rate 65% -> FALSE

if pass_rate/min_pass_rate > 1:
  return None
elif pass_rate/min_pass_rate > 0.9:
  return 0.05
elif pass_rate/min_pass_rate > 0.8:
  return 0.1
elif pass_rate/min_pass_rate > 0.7:
  return 0.2
else:
  return 0.3

fix_proportions.generate()
test_type   | pass_rate_ratio | proportion_increase
add_context | 0.85            | 0.1

fix_proportions.save(data_path='train.conll', save_path='augmented_train.conll', optimized_inplace=True)
chakravarthik27 commented 1 year ago

shall I ignore this "-DOCSTART- -X- -X- O\n"

luca-martial commented 1 year ago

@alierenak @chakravarthik27 for this sprint's implementation, let's forget about DOCSTART and POS tags since it looks like they are not currently handled by the datahandler module.

For now, POS tags will all be NN NN by default and DOCSTART will be totally ignored

luca-martial commented 1 year ago

@alierenak @chakravarthik27 the new decided approach is the following:

from nlptest import Harness

# Create your Harness
harness = Harness(task='ner', model='ner', data='test.conll', hub='johnsnowlabs')

# Generate a report of your test cases
tests = harness.generate().run().report()

# Augment your training set based on results from first harness, default inplace is True
tests.augment(input_path='train.conll', output_path='augmented.conll', inplace=True)

# Get a report of your augmentations that looks like the screenshot attached
tests.augmentation_report()

# Train your model based on augmented dataset
fitted_pipe = nlp.load('ner').fit(augmented.conll')

# Use the configs and generated sentences from the first harness .generate() method
# so that we can make a fair comparison with the first harness test results
harness_new = ...

# Generate a report of your new harness, no need to run `.generate()`
tests = harness_new.run().report()

Screenshot of .augmentation_report():

Screenshot 2023-02-24 at 18 57 17
chakravarthik27 commented 1 year ago

Hi @alierenak @luca-martial @ArshaanNazir ,

from my side, I have one idea about saving context, when we think about different formats of datasets, augmated_data also saves with respect to the given input format, right? so, we need to implement save functionality in datahandler classes only.

luca-martial commented 1 year ago

Hi @alierenak @luca-martial @ArshaanNazir ,

from my side, I have one idea about saving context, when we think about different formats of datasets, augmated_data also saves with respect to the given input format, right? so, we need to implement save functionality in datahandler classes only.

@chakravarthik27 yes augment should always save the output file in the same format as the input file