Open Weyaxi opened 1 month ago
@Weyaxi Are you talking about exact deduplication or fuzzy deduplication? I think exact deduplication is more revelant.
Hi @olivermolenschot,
I was referring to exact deduplication here, but it might be worth discussing the addition of fuzzy deduplication later on as well :)
The use case for this is as follows:
There are large curated datasets that have been published, but when developers want to use both of them, they have to work on deduplicating the merged datasets. This is because there will often be some duplication (e.g., Big Dataset A contains samples from small dataset x, and Big Dataset B also contains samples from small dataset x for example). But this can give devs some hard time because of the format diffrences etc.
Can you give examples on what would be format differences @Weyaxi ? I think I can easily provide a de-duplication feature for when the rows are an exact match. however if the format changes, we need to be more precise about what type of format changes are occurring. Covering all possible format changes might be tedious.
The format difference I mentioned is that, for example, I use both ShareGPT and Alpaca-type datasets at the same time when writing my config, but Axolotl merges those datasets into a single format in the end, right?
So, if I wanted to handle deduplication on my own, I would need to follow these steps in a typical scenario:
If Axolotl could handle that with a single line of change on the config file, it would be very beneficial IMO.
That's what I'm talking about.
I'm working on this feature.
β οΈ Please check that this feature request hasn't been suggested before.
π Feature description
A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.
βοΈ Solution
In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.
β Alternatives
There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.
π Additional Context
No response
Acknowledgements