code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.4k stars 68 forks source link

Upload into data storage #67

Open jhoetter opened 2 years ago

jhoetter commented 2 years ago

Is your feature request related to a problem? Please describe. I have multiple files which I want to combine, e.g. source_a and source_b. Or I want to modify data before I load it into a project; generally, I want to be able to program what I give as input.

Describe the solution you'd like Uploaded files should be stored in some data storage, and their files should be accessed programmatically. For instance, if I want to label duplicates in my data, I want to be able to loop over the rows and compare their embeddings to only insert into my projects interesting potential duplicate rows.

Describe alternatives you've considered Implementing that workflow outside of the app and then inserting the data

Additional context e.g. interesting to build training data for encoders that help to detect duplicates in my data

jhoetter commented 1 year ago

Currently implemented in workflow product