[ENHANCEMENT] optimise SDK log method and support mapping incoming columns to multiple dataset attributes - Githubissues

argilla-io / argilla

Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.

https://docs.argilla.io/en/latest/

3.63k stars 339 forks source link

[ENHANCEMENT] optimise SDK log method and support mapping incoming columns to multiple dataset attributes #5107

Open burtenshaw opened 5 days ago

burtenshaw commented 5 days ago

This PR supports mapping incoming columns/keys to dataset attributes in these two ways:

supports tuple values in the mapping parameter of the log method so that user can specify the two attributes as a tuple.
refactors in the _ingest_records methods so that mapping is performed once before the ingestion loop instead of during.

This PR also optimises the log method so that it takes less time and is easier to work with:

uses tqdm to log status
uses exception to show bad records
iterates over the map not the data

Screenshot 2024-06-26 at 09 09 11

Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

tests have been modified, deprecated, and updated to support changes in the ingestion flow

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
I confirm My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

nataliaElv commented 5 days ago

I'd rather have the records ingested and pushed in batches and have an easy way to identify those that threw an error, fix them and try to import those again. Otherwise it can take ages until I see any records in my dataset. Captura de pantalla 2024-06-26 a las 9 23 42