Closed zachgk closed 2 years ago
HI, I'm interested in this issue and I want to fix it, can you assign it to me? Thanks!
Yeah. I made an extra note in the description that this one is a bit more work then the Penn Treebank you did previously. Right now, none of the built-in datasets use audio. So, you will need to add a conversion between audio and NDArrays to implement the dataset. If you look at the references, they do have examples of doing this conversion with DJL so they should be very helpful to you. Let me know if you have any questions or get stuck anywhere
@zachgk Hello! @AKAGIwyf and I are working on this issue now. We encountering some problems and need your help.
Since audio datasets usually contains different formats of audio data (wav, flac, mp3, etc), we have to use ffmpeg to transform them into float arrays. In AIAS, they directly import whole javacv
module to the project. But we think the javacv
module is too big to import into djl.basicdataset
. So can we add javacv
as a new extension? Is this a duplicate of the original djl.opencv
extension?
Thanks!
It sounds like what you need from javacv
isn't an extension, but a dependency. For example, there is nothing stopping users from using javacv with DJL. They are both Java libraries and users can import both.
On the other hand, a javacv extension wouldn't help much. If djl.basicdataset
depends on djl.javacv
and djl.javacv
depends on javacv
, then djl.basicdataset
(transitively) depends on javacv
. This still pulls in the same big dependency into a user's project the same as if it was a direct dependency. In the djl.opencv
case, it is really about the automatic integration of djl with opencv through the ImageFactory class.
Instead, what might be better is to not put your dataset in basicdataset. You could create a new djl.audio
extension to hold the dataset. Then, users will only need the javacv
dependency if they use djl.audio
, not if they use djl.basicdataset
.
Description
Speech recognition is a task that converts an audio sequence into the text transcript of the words in the audio. It can be used for transcription of online videos, transcriptions of phone calls, text dictation, and controlling voice devices like Alexa. This issue is to add a first speech recognition dataset to DJL's basicdatasets.
Note that this requires adding some additional support for converting between audio and NDArrays. The references contain some examples from some DJL projects that already have audio conversions.
All of these tasks are useful tasks that DJL users may be interested in training. It also helps expand the DJL API into supporting more audio use cases.
References