Datawheel / template-chatbot

Template repository for a chatbot instance
MIT License
0 stars 1 forks source link

28th of June Update #5

Open pippo-sci opened 4 weeks ago

pippo-sci commented 4 weeks ago

Fine-tuning:

nicolasnetz commented 4 weeks ago

Company Names: Clustering using Semi Supervised Learning

I did a proof of concept of the usage of a clustering model that uses semi supervised learning to group Company names by looking at their name and address (further variables could be included in the future).

I used the data set that we have been manually cleaning on the spreadsheet to perform a test.

Here I trained the model over a subset of the manually validated names:

Overall, the model saw 3910 rows of data such as:

raw_name_id raw_name raw_address
58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US
34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US
20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA
819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA
68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US

Which correspond to 276 manually identified companies.

Then, out of those 3910 rows (which produce 15.288.100 pairs) I trained on a semisupervised setting the model, by reviewing manually 130 pairs of rows and marking them as (114) "the same company" or (17) "not the same company".

After this, the model applied the clustering and outputted for each row, a cluster ID, to which it belongs. It found 331 clusters. In the following table, the output of the model, with the cluster id and confidence score is shown.

Cluster ID confidence_score raw_name_id raw_name raw_address
80 0.859248 58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US
80 0.859248 34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US
80 0.859263 20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA
80 0.859264 819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA
80 0.859250 68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US

Now to assess the result, the precision and recall of the process were calculated against the ground truth that we manually created on the spreadsheet.

To further test the model and check if it's not overfitting, the same test was applied on data that didn't belong to the training set, which consisted of 9000 companies, which were grouped into 61 clusters, but were actually identified as 30 companies. Here the precision and recall went a bit lower, but still not bad.

The goal now is to make this scale, so it finds more clusters. This was done on RAM and using a CSV, but the library allows for it to connect to a postgres database and work with more rows.

Consider that this only is taking a look at 9000 companies, and the entire dataset has around 10.000.000 company names only for consignee names.

true_id true_name validated_name raw_name_id raw_name raw_address count_value
63 MELISSA & DOUG, LLC Melissa & Doug 58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US 607
63 MELISSA & DOUG, LLC Melissa & Doug 34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US 497
63 MELISSA & DOUG, LLC Melissa & Doug 20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA 295
63 MELISSA & DOUG, LLC Melissa & Doug 819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA 249
63 MELISSA & DOUG, LLC Melissa & Doug 68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US 170
alebjanes commented 4 weeks ago

1.1 Question ID = 1 (6,231 questions of type "How much did Exporter Country export in Year?")

Model Question ID Correct matches Accuracy (%)
Mixtral 1 333 5.3
Llama3 1 398 6.4
all-mpnet-base-v2 1 4425 71.0
multi-qa-MiniLM-L6-cos-v1 1 4260 68.4
multi-qa-mpnet-base-cos-v1 1 4858 78.0
all-MiniLM-L12-v2 1 4027 64.6

1.2 Question ID = 2 (20,000 out of 46,872 questions of type "How much did Exporter Country export of HS in 2022?")

Model Question ID Correct matches Accuracy (%)
Mixtral 2 288 1.4
Llama3 2 516 / 10325 ~5.0
all-mpnet-base-v2 2 19734 98.7
multi-qa-MiniLM-L6-cos-v1 2 19639 98.2
multi-qa-mpnet-base-cos-v1 2 19782 98.9
all-MiniLM-L12-v2 2 19700 98.5

1.3 Question ID = 3 (15,000 out of 34,955 questions of type "How much HS was traded in Year?")

Model Question ID Correct matches Accuracy (%)
Mixtral 3 732 4.9
Llama3 3 646 / 8052 8.0
all-mpnet-base-v2 3 13499 90.0
multi-qa-MiniLM-L6-cos-v1 3 13862 92.4
multi-qa-mpnet-base-cos-v1 3 14435 96.2
all-MiniLM-L12-v2 3 12860 85.7

For next week (on my side):

  1. Add more context to the corpus to answer these new questions
  2. Evaluate RAG

Then for the week after:

  1. Evaluate the multi-layer approach we were working on initially