28th of June Update - Githubissues

pippo-sci commented 4 weeks ago

Fine-tuning:

4 models are working on Ollama (3 tinyLlama verisons with 1, 10, 50 epoch)
I was able to train an Llama2 model (1 epoch only)
Llama.cpp depricated some functionality which made transform safetensors to gguf format
Running evaluation on small set locally

nicolasnetz commented 4 weeks ago

Company Names: Clustering using Semi Supervised Learning

I did a proof of concept of the usage of a clustering model that uses semi supervised learning to group Company names by looking at their name and address (further variables could be included in the future).

I used the data set that we have been manually cleaning on the spreadsheet to perform a test.

Here I trained the model over a subset of the manually validated names:

The companies that most show up
For each company, only take the top 15 most repeated matches.

Overall, the model saw 3910 rows of data such as:

raw_name_id	raw_name	raw_address
58517298	MELISSA & DOUG, LLC	10 WESTPORT ROAD WILTON CT 06897 US
34062848	MELISSA & DOUG, LLC	141 DANBURY RD WILTON CT 06897-441 US
20411	MELISSA & DOUG LLC	141 DANBURY ROAD WILTON CT 068 USA
819001	MELISSA & DOUG, LLC	141 DANBURY ROAD WILTON CT 06897 USA
68400195	MELISSA & DOUG LLC	10 WESTPORT ROAD WILTON CT 06897 US

Which correspond to 276 manually identified companies.

Then, out of those 3910 rows (which produce 15.288.100 pairs) I trained on a semisupervised setting the model, by reviewing manually 130 pairs of rows and marking them as (114) "the same company" or (17) "not the same company".

After this, the model applied the clustering and outputted for each row, a cluster ID, to which it belongs. It found 331 clusters. In the following table, the output of the model, with the cluster id and confidence score is shown.

Cluster ID	confidence_score	raw_name_id	raw_name	raw_address
80	0.859248	58517298	MELISSA & DOUG, LLC	10 WESTPORT ROAD WILTON CT 06897 US
80	0.859248	34062848	MELISSA & DOUG, LLC	141 DANBURY RD WILTON CT 06897-441 US
80	0.859263	20411	MELISSA & DOUG LLC	141 DANBURY ROAD WILTON CT 068 USA
80	0.859264	819001	MELISSA & DOUG, LLC	141 DANBURY ROAD WILTON CT 06897 USA
80	0.859250	68400195	MELISSA & DOUG LLC	10 WESTPORT ROAD WILTON CT 06897 US

Now to assess the result, the precision and recall of the process were calculated against the ground truth that we manually created on the spreadsheet.

precision: 0.9169
recall: 0.9452

To further test the model and check if it's not overfitting, the same test was applied on data that didn't belong to the training set, which consisted of 9000 companies, which were grouped into 61 clusters, but were actually identified as 30 companies. Here the precision and recall went a bit lower, but still not bad.

precision: 0.9999
recall: 0.7227

The goal now is to make this scale, so it finds more clusters. This was done on RAM and using a CSV, but the library allows for it to connect to a postgres database and work with more rows.

Consider that this only is taking a look at 9000 companies, and the entire dataset has around 10.000.000 company names only for consignee names.

true_id	true_name	validated_name	raw_name_id	raw_name	raw_address	count_value
63	MELISSA & DOUG, LLC	Melissa & Doug	58517298	MELISSA & DOUG, LLC	10 WESTPORT ROAD WILTON CT 06897 US	607
63	MELISSA & DOUG, LLC	Melissa & Doug	34062848	MELISSA & DOUG, LLC	141 DANBURY RD WILTON CT 06897-441 US	497
63	MELISSA & DOUG, LLC	Melissa & Doug	20411	MELISSA & DOUG LLC	141 DANBURY ROAD WILTON CT 068 USA	295
63	MELISSA & DOUG, LLC	Melissa & Doug	819001	MELISSA & DOUG, LLC	141 DANBURY ROAD WILTON CT 06897 USA	249
63	MELISSA & DOUG, LLC	Melissa & Doug	68400195	MELISSA & DOUG LLC	10 WESTPORT ROAD WILTON CT 06897 US	170

alebjanes commented 4 weeks ago

In terms of the embedding evaluation, Llama3 was taking a lot of time, and results were not improving so I decided to stop it earlier. For the other 4 models (the ones with better results), I ran them again but this time keeping track of some more metrics we’ll use when evaluating other approaches. Updated results for the embeddings:

1.1 Question ID = 1 (6,231 questions of type "How much did Exporter Country export in Year?")

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	1	333	5.3
Llama3	1	398	6.4
all-mpnet-base-v2	1	4425	71.0
multi-qa-MiniLM-L6-cos-v1	1	4260	68.4
multi-qa-mpnet-base-cos-v1	1	4858	78.0
all-MiniLM-L12-v2	1	4027	64.6

1.2 Question ID = 2 (20,000 out of 46,872 questions of type "How much did Exporter Country export of HS in 2022?")

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	2	288	1.4
Llama3	2	516 / 10325	~5.0
all-mpnet-base-v2	2	19734	98.7
multi-qa-MiniLM-L6-cos-v1	2	19639	98.2
multi-qa-mpnet-base-cos-v1	2	19782	98.9
all-MiniLM-L12-v2	2	19700	98.5

1.3 Question ID = 3 (15,000 out of 34,955 questions of type "How much HS was traded in Year?")

Model	Question ID	Correct matches	Accuracy (%)
Mixtral	3	732	4.9
Llama3	3	646 / 8052	8.0
all-mpnet-base-v2	3	13499	90.0
multi-qa-MiniLM-L6-cos-v1	3	13862	92.4
multi-qa-mpnet-base-cos-v1	3	14435	96.2
all-MiniLM-L12-v2	3	12860	85.7

Now I’m finishing up a new evaluation set of 100 questions we’ll use across all approaches. This set has simple questions that are present in the corpus, and also some more complex ones (like growth, or top exporters, etc) that the models might be able to answer. We want to keep this evaluation set small in order to get fast results, keep the costs low and also be able to manually evaluate the model’s answers. The approaches we’ll be evaluating with these are: RAG, multi-layer, simple GPT API call (like asking chatGPT), and fine-tuning.

For next week (on my side):

Add more context to the corpus to answer these new questions
Evaluate RAG

Then for the week after:

Evaluate the multi-layer approach we were working on initially

Datawheel / template-chatbot

28th of June Update #5