Inquiring about the SpokenSTS database

lingjzhu commented 2 years ago

Hi! I really enjoyed reading your Interspeech paper "Semantic sentence similarity: size does not always matter"! The idea is really novel and the results are impressive! May I ask a question about the paper? In the paper, you mention that "All synthetic and natural utterances are made publicly available in .wav format as the SpokenSTS database". It seems that the data is not hosted in this github repo. Are there plans to release the data in the coming future? Thanks in advance. Looking forward to hearing back from you!

DannyMerkx commented 2 years ago

Thanks for your interest in our work! I'm sorry the data isn't properly hosted yet, I've been on sick leave for quite a while and haven't gotten around to finishing this. If the paper is getting attention I should fix this as soon as possible, thanks for reminding me. I'll let you know when and where it is hosted! Regards, Danny

Sent from ProtonMail mobile

-------- Original Message -------- On 20 Dec 2021, 22:36, jzhu wrote:

Hi! I really enjoyed reading your Interspeech paper "Semantic sentence similarity: size does not always matter"! The idea is really novel and the results are impressive! May I ask a question about the paper? In the paper, you mention that "All synthetic and natural utterances are made publicly available in .wav format as the SpokenSTS database". It seems that the data is not hosted in this github repo. Are there plans to release the data in the coming future? Thanks in advance. Looking forward to hearing back from you!

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.Message ID: @.***>

lingjzhu commented 2 years ago

I am sorry to hear about the sick leave. I hope you will have a speedy recovery. Your paper is great and I will definitely cite it in my subsequent works! Thank you!

All the best, Jian

DannyMerkx commented 2 years ago

Hello Jian,

The dataset is currently being approved here for uploading it to our universities data archives. If you are still interested and want the data in advance I can try to send it through some other sharing service, just let me know.

regards,

Danny Merkx PhD Candidate Centre for Language Studies Radboud University Nijmegen +31 24-3611461

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, December 22nd, 2021 at 5:14 PM, jzhu @.***> wrote:

I am sorry to hear about the sick leave. I hope you will have a speedy recovery. Your paper is great and I will definitely cite it in my subsequent works! Thank you!

All the best, Jian

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

lingjzhu commented 2 years ago

Hi Danny,

I really appreciate that! I am also working on semantic representations so having access to the data will be of great help to my research. I will only use the data for research and cite your paper for sure. My email is lingjzhu@umich.edu. Thank you!

Jian

ankitapasad commented 2 years ago

Hi @DannyMerkx

Thank you for collecting and synthesizing the spoken version of the STS database!

I am looking forward to using it for my research but I haven't been able to download it from the DANS interface. Selecting the directories and then clicking on "download" does not download any files to my system. I am not sure what I am doing wrong.

Can you please help me with the access?

Thanks!

lingjzhu commented 2 years ago

Hi @DannyMerkx, I can confirm what @ankitapasad had experienced. I was able to download text documents after some delay. However, downloading audio folders does not seem to work. I really appreciate it if you could help with this. Thank you regardless! Jian

DannyMerkx commented 2 years ago

I am so sorry this has happened, I've taken up the issues with this dataset several times with DANS and apparently it is still not in order. I will contact them again and tell them the data is not properly available.

Regards,

Danny Merkx

------- Original Message ------- On Tuesday, May 3rd, 2022 at 4:27 PM, jzhu @.***> wrote:

Hi @.(https://github.com/DannyMerkx), I can confirm what @.(https://github.com/ankitapasad) had experienced. I was able to download text documents after some delay. However, downloading audio folders does not seem to work. I really appreciate it if you could help with this. Thank you regardless! Jian

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lingjzhu commented 2 years ago

Thank you, Danny! After some trials and errors, I was able to download the natural speech just now by downloading individual subfolders because DANS limits the total number of files per request is 400. But for the synthetic speech, the files in each subfolder exceed that limit. I was wondering if uploading compressed .zip files could circumvent this limitation and makes downloading easier and faster. Given that you have contacted DANS, maybe they have a better solution :)

DannyMerkx commented 2 years ago

I see that now too this is very inconvenient for such a large database. I'll find a quick way to host the data myself, possibly over dropbox, and contact the university to get this in order. I no longer work there, and so also no longer own the data so I have to pass this information on to the data steward that is now in charge of this database.

DannyMerkx commented 2 years ago

The uni notified DANS and are picking up both issues. They're hurrying up with making the data available and will upload the data to a place where it can be downloaded as a whole, but only upon request. In the meantime I will pick up a hardcopy next week and host it myself on dropbox for a while. Sorry for the inconvenience, you'll get the data next week, again thanks for your interest in my work !

Regards,

Danny

Sent from ProtonMail mobile

-------- Original Message -------- On 3 May 2022, 18:25, jzhu wrote:

Thank you, Danny! After some trials and errors, I was able to download the natural speech just now by downloading individual subfolders because DANS limits the total number of files per request is 400. But for the synthetic speech, the files in each subfolder exceed that limit. I was wondering if uploading compressed .zip files could circumvent this limitation and makes downloading easier and faster. Given that you have contacted DANS, maybe they have a better solution :)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ankitapasad commented 2 years ago

Hi Danny,

I appreciate your responsiveness! Thank you for keeping us in the loop.

Best, Ankita

lingjzhu commented 2 years ago

Hi Danny, Thank you so much. I am really sorry to hear that this may also cause you some inconvenience. Your work is great! Really enjoyed reading your paper and hoped to do follow-up research on your work! Best, Jian

DannyMerkx commented 2 years ago

Thanks for your patience. I'm working on a self hosted version of the database, but DANS has also worked on an alternative. The DANS page is updated and says you can mail @.*** for access to the database. You should receive a download link. I assume this will finally work and I will post further updates when my own version is up.

Regards,

Danny

Sent from ProtonMail mobile

-------- Original Message -------- On 7 May 2022, 18:00, jzhu wrote:

Hi Danny, Thank you so much. I am really sorry to hear that this may also cause you some inconvenience. Your work is great! Really enjoyed reading your paper and hoped to do follow-up research on your work! Best, Jian

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lingjzhu commented 2 years ago

Thank you Danny!

ankitapasad commented 2 years ago

Thanks a lot, Danny!

It took them a while to process my request at first and I had to open a duplicate ticket to get their attention. They got back to me within 24 hours on the duplicate ticket, so the initial inaction was likely a one-off incident. I am downloading the dataset now :)

lingjzhu commented 2 years ago

Thanks a lot, Danny!

It took them a while to process my request at first and I had to open a duplicate ticket to get their attention. They got back to me within 24 hours on the duplicate ticket, so the initial inaction was likely a one-off incident. I am downloading the dataset now :)

I have exactly the same experience. The only difference is that they still haven't replied to my duplicate ticket. But given your experience, I think the download link will be out soon!

DannyMerkx commented 2 years ago

Thanks for the feedback, good to know the DANS alternative works even though we're dependent on a human response to requests. We're working on an alternative through Open Science Foundation as well. Looking forward to seeing both your results.

Regards,

Danny

Sent from ProtonMail mobile

-------- Original Message -------- On 20 May 2022, 19:46, jzhu wrote:

Thanks a lot, Danny!

It took them a while to process my request at first and I had to open a duplicate ticket to get their attention. They got back to me within 24 hours on the duplicate ticket, so the initial inaction was likely a one-off incident. I am downloading the dataset now :)

I have exactly the same experience. The only difference is that they still haven't replied to my duplicate ticket. But given your experience, I think the download link will be out soon!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

juice500ml commented 4 months ago

Dear @DannyMerkx, FYI, it seemed that DANS was quite slow to download. So, I pushed the mirrored version (only the human recordings) to huggingface datasets. https://huggingface.co/datasets/juice500/spoken_sts The license was CC BY 4.0, so redistribution should be okay. But just wanted to let you know nevertheless.

DannyMerkx commented 4 months ago

Hey @juice500ml, I've left academia (for now) and haven't been keeping tabs on my old projects since, sorry to hear the DANS distribution is not an easy way to get the data. It is nice to hear the data is still helpful and interesting for people, you are indeed free to use and share it through channels that keep it free and open ^^

DannyMerkx / speech2image

Inquiring about the SpokenSTS database #2