Open lukewys opened 2 years ago
@rvencu @rom1504 We need more data in the next step. The data we need in the ranking of priority is:
For audio data with natural text description, we further need:
For audio data with other labels, we need to collect new large datasets while converting our current dataset with tag labels.
The datasets in top priority are those with large size and easy to turn labels into a text description:
(The following datasets all are those with tag labels of the audio)
The datasets we currently have that need converting labels to text are:
We should come up with a unified way of converting tags to text. We could reference how CLIP did that (in converting classification to natural text).
For example, wesoundeffect data sets, it seems a bit reluctant to use files as captions For example :Bowling_Re-Rack_Machinery_All-Lanes-In-A-Row.wav,
@rvencu @rom1504 We need more data in the next step. The data we need in the ranking of priority is:
For audio data with natural text description, we further need:
For audio data with other labels, we need to collect new large datasets while converting our current dataset with tag labels.
The datasets in top priority are those with large size and easy to turn labels into a text description:
(The following datasets all are those with tag labels of the audio)
The datasets we currently have that need converting labels to text are:
We should come up with a unified way of converting tags to text. We could reference how CLIP did that (in converting classification to natural text).