ManifoldRG / NEKO

In Progress Implementation of GATO style Generalist Multimodal model capable of image, text, RL and Robotics tasks
https://discord.gg/brsPnzNd8h
GNU General Public License v3.0
46 stars 10 forks source link

Dataset Reconcilliation #56

Open harshsikka opened 11 months ago

harshsikka commented 11 months ago

Context:

@snat-s has made significant progress in reviewing and updating the planned multimodal dataset (combination of many datasets) for the NEKO model.

there are numerous older issues that we want to close the loop on.

Objective: Review outstanding dataset related issues, and determine relevancy with current dataset efforts. Have the datasets been reviewed/processed in our new dataset effort? If the issues are no longer relevant, close them. If they are relevant still, update the issues, add them to backlog.

Relevant issues: #1 #4 #5 #6 #11 #31 #39 #48 #49 #50 #51 #52 #53

snat-s commented 10 months ago

1, #5, #6, #11 I think are out of scope. For the rest here is what I think of each one:

31 Already checked the datasets for Flamingo and commented there

49 We are replicating the BabyAI dataset now.

50 I am currently looking into creating the dataloaders for AOKVQA, VQAv2 and Conceptual Captions.

51 I closed it, researched the alternatives and gave context on why COYO700M is a good replacement.

52 Placed context for alternatives for M3W. For test runs I think the best idea is to use mmc4-ff-core.

53 I think we have discussed enough about the alternatives to MassiveText. But if we are thinking about creating a mini benchmark, we could use three different datasets: C4, MiniPile and OpenWebText. A discussion on what a mini dataset would look like might be a good idea.

I will talk more about a tiny dataset in issue #58.