activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Mozilla Public License 2.0
7.87k stars 605 forks source link

[BUG] Using TIMIT dataset seems to violate dataset licensing agreement #2863

Closed juice500ml closed 1 month ago

juice500ml commented 1 month ago

Severity

P1 - Urgent, but non-breaking

Current Behavior

TIMIT dataset is part of the Linguistic Data Consortium (LDC), and the dataset license seems to be governed by the LDC Non-member agreement, which explicitly states that User shall have no right to copy, redistribute, transmit. timit-train and timit-test is possibly breaching the licensing agreement.

Steps to Reproduce

https://datasets.activeloop.ai/docs/ml/datasets/timit-dataset/

Expected/Desired Behavior

Potentially remove TIMIT.

Python Version

No response

OS

No response

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

mikayelh commented 1 month ago

hey @juice500ml , thanks a lot for bringing this issue up. My reading of the user agreement is a bit different, 'for other purposes' may mean 'non-research' or commercial purposes, which this is not. Besides, it's been available for a while now via Deep Lake and Hugging Face (here), so not sure if it's an issue.

Are you by any chance with LDC? We could potentially include a user agreement before user accesses the dataset, which is common in such cases.

juice500ml commented 1 month ago

Dear @mikayelh , thanks a lot for a quick response! To my understanding, this clause becomes a problem for redistributing the data.

Unless explicitly permitted herein, User shall not otherwise publish, retransmit, disclose, display, copy, reproduce or redistribute the LDC Databases to others outside of User’s Research Group.

In this case, I think, everyday user of your project would be well outside of the definition of "User's Research Group", or I might be wrong. Also, I believe what you mentioned about huggingface distribution is this, right? https://huggingface.co/datasets/timit_asr In that case, one has to download the data from LDC manually.

My affiliation (CMU) is part of LDC, but I'm not exactly with LDC, so I won't be able to answer those kind of questions :( Actually, I was looking ways to include LDC within a public project also, and I stumbled upon this project.

mikayelh commented 1 month ago

Got it, @juice500ml! I'll reach out to the contact listed on their website and we will take down the dataset or include the user agreement if they desire so. Not sure about this specific dataset, but a large part of datasets, including this one, has been included long ago and we filtered out ones that were restrictive, so unless this agreement was implemented later on, that wouldn't be an issue.

In your specific case though seems like you'd be able to use the dataset via Deep Lake without an issue.

Thanks again for letting us know!

juice500ml commented 1 month ago

I see, hope everything works out! Thanks a lot!