Im going to be 100% honest, from my understanding and the documentation around nemo curator LINK. It seems that curator is not actually going to be useful for us at all.
From the above link: NeMo curator can do these things
Download and Extracting data - Unneeded as we have data extraction already (also requires dataset from NeMo so we would have to convert are data mulitple time)
Working with DocumentDataset
CPU and GPU modules with Dask - Not using Dask
Document filtering - Documents don't need to be filtered since we are not training a model
Language identification and unicode fixing - All documents are in English
GPU accelerated exact and fuzzy deduplication - Not training a model, so unneeded
GPU semantic deduplication - Not training a model, so unneeded
Synthetic data generation - Not needed
Downstream task decontamination - Not training a model, so unneeded
PIII removal - Not including any PIII (besides CSUSB public info), so unneeded
@ACraig7 please put your opinion on the provided information, bc it seems that Curator is pointless for us and our project
Contact @itsnotmik @ACraig7 if help is needed
Implement NeMo curator into the current RAG model
Im going to be 100% honest, from my understanding and the documentation around nemo curator LINK. It seems that curator is not actually going to be useful for us at all.
From the above link: NeMo curator can do these things
@ACraig7 please put your opinion on the provided information, bc it seems that Curator is pointless for us and our project