Feature Request: Adding dataset deduplication process

Weyaxi commented 1 month ago

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.

✔️ Solution

In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.

❓ Alternatives

There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

olivermolenschot commented 1 week ago

@Weyaxi Are you talking about exact deduplication or fuzzy deduplication? I think exact deduplication is more revelant.

Weyaxi commented 1 week ago

Hi @olivermolenschot,

I was referring to exact deduplication here, but it might be worth discussing the addition of fuzzy deduplication later on as well :)

The use case for this is as follows:

There are large curated datasets that have been published, but when developers want to use both of them, they have to work on deduplicating the merged datasets. This is because there will often be some duplication (e.g., Big Dataset A contains samples from small dataset x, and Big Dataset B also contains samples from small dataset x for example). But this can give devs some hard time because of the format diffrences etc.

olivermolenschot commented 1 week ago

Can you give examples on what would be format differences @Weyaxi ? I think I can easily provide a de-duplication feature for when the rows are an exact match. however if the format changes, we need to be more precise about what type of format changes are occurring. Covering all possible format changes might be tedious.

Weyaxi commented 1 week ago

The format difference I mentioned is that, for example, I use both ShareGPT and Alpaca-type datasets at the same time when writing my config, but Axolotl merges those datasets into a single format in the end, right?

So, if I wanted to handle deduplication on my own, I would need to follow these steps in a typical scenario:

Convert all datasets to a single format.
Perform deduplication on my own.
Create a new dataset.
Input that dataset into Axolotl.

If Axolotl could handle that with a single line of change on the config file, it would be very beneficial IMO.

That's what I'm talking about.

olivermolenschot commented 1 week ago

I'm working on this feature.

axolotl-ai-cloud / axolotl