issues
search
esbatmop
/
MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
MIT License
3.36k
stars
231
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
商用协议
#59
wangchunlin
closed
1 month ago
0
test
#58
Persiefxy
closed
1 month ago
0
wikipedia与code_metadata数据有误
#57
fitexmage
closed
2 months ago
2
crawler_oscar与wikipedia数据有误
#56
fitexmage
closed
4 months ago
2
数据解压异常
#55
fuyao2006
closed
5 months ago
1
fix(README.md): typo
#54
Tiphereth-A
closed
5 months ago
0
是否考虑通过IPFS分发数据?
#53
POFK
opened
6 months ago
3
可以对数据集做二次加工后再开源吗
#52
charent
closed
7 months ago
1
huggingface 上的数据现在有多大,下载的话硬盘最少要多少
#51
zonggit
closed
7 months ago
1
添加BT种子分享方式
#50
jiyun
opened
9 months ago
0
百度网盘下载的文件解压缩需要密码
#49
ZeyuTeng96
closed
9 months ago
2
洗稿工具垃圾网页识别
#48
chenhehong
closed
9 months ago
1
huggingface被墙,是否考虑也往modelscope平台上传一份数据
#47
MrZixi
opened
9 months ago
4
wikipedia JSONDecodeError
#46
TristanMeng
closed
4 months ago
1
数据丰富度问题
#45
GDUTT1
closed
9 months ago
1
huggingface将下载好的数据下载到本地,通过本地加载的方式报错。
#44
amanyara
closed
9 months ago
1
添加种子与磁力分享与说明
#43
jiyun
closed
9 months ago
0
oscar语料的一些问题
#42
chinoll
closed
9 months ago
1
huggingface上传的文件编码不统一
#41
chinoll
closed
7 months ago
1
数据分发有考虑过使用S3进行存储和提供下载吗
#40
chinoll
opened
10 months ago
1
20230126.zip压缩包问题
#39
fuyao2006
closed
11 months ago
2
威力里的数据跟抱脸里的数据是否一样?
#38
Gierry
opened
11 months ago
5
祝贺语料翻倍!
#37
Triang-jyed-driung
opened
11 months ago
2
huggingface百度网盘
#36
zachluo
closed
12 months ago
2
压缩包版本记录?
#35
shenck0
closed
12 months ago
3
现在已经清洗好10G了吗
#33
Sweetclover
closed
1 year ago
2
数据量
#32
guozhiyao
closed
1 year ago
1
需要算力支持吗~
#31
litetoooooom
opened
1 year ago
1
huggingface数据进度
#30
guozhiyao
opened
1 year ago
3
用阿里云盘或者其他网盘?
#29
hijkzzz
closed
1 year ago
2
一个小小的建议
#28
LlinWing
closed
1 year ago
1
如何支持项目
#27
YuqiHUO
closed
1 year ago
1
百度网盘链接打不开了
#26
geniuslinchao
closed
1 year ago
4
Update README.md
#25
washing1127
closed
1 year ago
0
网站上传文件的困难
#24
Triang-jyed-driung
opened
1 year ago
6
数据清洗工具
#23
guozhiyao
closed
1 year ago
1
数据清洗工具
#22
guozhiyao
opened
1 year ago
1
一人行快,众人行远
#21
kingfs
closed
1 year ago
3
无法正常显示的字符编码
#20
LlinWing
opened
1 year ago
4
请问MNBVC会跟Common Crawl有重叠部分吗
#19
CoinCheung
closed
1 year ago
3
提一个观察到的数据问题
#18
gycg
opened
1 year ago
2
huggingface数据集
#17
lwmlyy
closed
1 year ago
4
百度网盘提取码
#16
zhangyue2000512
closed
1 year ago
1
解压需要密码?
#15
echo840
closed
1 year ago
0
解压需要密码?
#14
echo840
closed
1 year ago
1
hugging face上的jsonl.gz无法解压
#13
guozhiyao
closed
1 year ago
1
如何处理json文件, 每个json文件格式不太一样?
#12
Mddct
closed
1 year ago
1
如何校验下载内容?
#11
PussyCat0700
closed
1 year ago
2
感谢分享!请问下载之后发现压缩包里面的文件都要密码,这个是什么呢?
#10
BohanLi0110
closed
1 year ago
1
求种子文件链接,现在只有百度网盘
#9
hejunqing
closed
1 year ago
3
Next