Closed huangruizhe closed 2 years ago
@huangruizhe are you also planning on grabbing newer data with this replication?
Hi, thank you for pointing out the issue. It has been a while since our work is done, I need some time to check the data and let you know if there is anything wrong with the released data.
@firmai Our main interests are in automatic speech recognition. For the moment, we will probably work with existing datasets. @GeminiLn Thanks in advance!!
Hi, I checked the data and noticed that the data on our side is correct. But the text sequences in the released data are mismatched for some reason. I'll update the text data within one or two days. The audio sequences are correct.
Thank you again for pointing out the problem! @huangruizhe
@huangruizhe Sorry for the delay. I experienced some connection problems last week. The new dataset is available now. Please find the link in the README file. The audio and text should be all matched now.
Thanks @GeminiLn ! I will definitely check it out. Thanks again for the update and the work!
Hi @GeminiLn, I have checked out the new dataset, but something may have gone wrong. After downloading all zip files and unzipping them (which takes a long time), I get only 53 calls instead of 575 as before.
Would you suggest what may have been wrong?
During unzipping, there came a lot of errors, e.g.
... file #82663: bad zipfile offset (lseek): 1046069248 file #82664: bad zipfile offset (lseek): 1046151168 file #82665: bad zipfile offset (lseek): 1046323200 file #82666: bad zipfile offset (lseek): 1046364160 file #82667: bad zipfile offset (lseek): 1046429696 file #82668: bad zipfile offset (lseek): 1046462464 file #82669: bad zipfile offset (lseek): 1046544384 file #82670: bad zipfile offset (lseek): 1046683648 file #82671: bad zipfile offset (lseek): 1046740992 file #82672: bad zipfile offset (lseek): 1046773760 file #82673: bad zipfile offset (lseek): 1046798336 file #82674: bad zipfile offset (lseek): 1046953984 file #82675: bad zipfile offset (lseek): 1047076864 file #82676: bad zipfile offset (lseek): 1047150592 file #82677: bad zipfile offset (lseek): 1047265280 file #82678: bad zipfile offset (lseek): 1047347200 file #82679: bad zipfile offset (lseek): 1047412736 file #82680: bad zipfile offset (lseek): 1047494656
I was using this command under the directory as in the screenshot: unzip -qq ACL19_Release.zip
Just FYI.
I guess it might be the unzip issue: https://support.firmex.com/hc/en-us/articles/204579673-Downloads-with-multiple-parts-z01-and-z02-files-#1-download-all-the-parts-to-the-same-folder-on-your-computer-0-1
You might have worked on windows and used Winzip or WinRAR to compress the files. I may need to use the same software to uncompress them, instead of "unzip" command on linux. I will look into this.
I happen to have a windows machine, so I could unzip the dataset with Winzip successfully. Would be great if the dataset can be released in other formats compatible with Linux (-- just a small suggestion).
Hi @huangruizhe , thank you for the feedback. Actually, the file processing and compression are done on a Linux machine. I'll look into it to see if there is anything wrong during the compress. Are you able to access the full dataset with WinZip?
Hi, @huangruizhe . I tested it on my Linux machine.
You might need to use: zip -s0 ACL19_Release.zip --out ACL19_Release_All.zip
to merge the files.
Then use: unzip -q ACL19_Release_all.zip
to unzip the dataset.
Sorry for the inconvenience. I will write an instruction in the README file. The new dataset is split because some researchers from China mainland have trouble downloading large files from Google Drive.
When I switched to WinZip, it went okay -- so I assumed that this was how the data was prepared. I finally got 572 earnings calls and 89722 audios in total. Is that correct? Thanks for the commands for merging and unzipping the files. It will be useful for others who are interested in the dataset.
My guess is that WinZip merges the files automatically, but the zip command on Linux will not do that. And yes, you get the correct dataset. I hope it will be helpful for your research.
Thanks you!
Hi, thanks for your efforts in providing the datasets online. It's a lot of work. When we were working with this dataset, we found some issues with the mapping between recording and text.
According to the description:
But in fact, we found this is not the case. Among 575 earnings conference calls, half of them have unequal length of transcription and number of audio files. This makes recovering the correct audio-text mapping hard. I hope to raise the issue here and let people be aware of it. Hope we could discuss how to solve it.