Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
Feel free to use at much as possible of these tutorials but it is also a good excuse to review and re-write.
Some things to keep in mind:
Start by identifying a real-world problem and/or dataset. It shouldn't be a toy example but something someone might actually search for.
Explain this data and and model, to emphasize why they are used. e.g. GliNER is works for zero-shot NER but is costly to run inference, hence we can start and mover over to SpanMarker as cost-efficient few shot technique.
Evaluate usage and show results! We've optimized the RAG pipeline so we get better results. We've optimized a NER model so it now classifies X.
### Tasks
- [ ] Bootstrapping textcat with few-shot setfit and potentially sentence-tranformers for semantic search.
- [ ] bootstrapping spancat with zero-shot and few-shot gliner and NER.
- [ ] bootstrap project with LLMs with spacy-llm or other methods like llama-index, prompt engineering etc (feel free to choose what you want).
- [ ] Multi-model project with sentence tranformers and bulk labelling of images/PDFs etc
- [ ] Monitor for data, model, and annotator drift with BertTopic and text-descriptives.
- [ ] RAG: optimize retrievers and rerankers with haystack and sentence-transformers.
- [ ] RAG: optimize LLMs: haystack and trl.
- [ ] Instruction-tuning an LLM: SFT with TRL
- [ ] Preference tuning an LLM: DPO with TRL
Feel free to use at much as possible of these tutorials but it is also a good excuse to review and re-write.
Some things to keep in mind:
Good examples https://haystack.deepset.ai/tutorials/27_first_rag_pipeline