bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

Create dataset hal_archives_ouvertes #225

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
cakiki commented 2 years ago

self-assign

cakiki commented 2 years ago

Metadata: https://huggingface.co/datasets/bigscience-catalogue-data/hal_archives_ouvertes/blob/main/hal_archives_ouvertes_metadata.jsonl.gz

Metadata sample:

{'openAccess_bool': True,
 'domainAllCode_s': ['scco.psyc'],
 'en_title_s': ['Self-motion and the perception of stationary objects'],
 'title_s': ['Self-motion and the perception of stationary objects'],
 'abstract_s': ["One of the ways we perceive shape is through seeing motion. Visual motion may be actively generated (for example, in locomotion), or passively observed. In the study of the perception of 3D structure from motion (SfM), the non-moving, passive observer in an environment of moving rigid objects has been used as a substitute for an active observer moving in an environment of stationary objects; the 'rigidity hypothesis' has played a central role in computational and experimental studies of SfM. Here we demonstrate that this substitution is not fully adequate, because active observers perceive 3D structure differently from passive observers, despite experiencing the same visual stimulus: active observers' perception of 3D structure depends on extraretinal self-motion information. Moreover, the visual system, making use of the self-motion information treats objects that are stationary (in an allocentric, earth-fixed reference frame) differently from objects that are merely rigid. These results show that action plays a central role in depth perception, and argue for a revision of the rigidity hypothesis to incorporate the special case of stationary objects."],
 'journalTitle_s': 'Nature',
 'journalIssn_s': '0028-0836',
 'journalEissn_s': '1476-4679',
 'authLastName_s': ['Wexler', 'Panerai', 'Lamouret', 'Droulez'],
 'authFirstName_s': ['Mark', 'Francesco', 'Ivan', 'Jacques'],
 'language_s': 'en',
 'halId_s': 'hal-00000019',
 'uri_s': 'https://hal.archives-ouvertes.fr/hal-00000019',
 'docType_s': 'ART',
 'publicationDate_tdate': '2001-01-01T00:00:00Z',
 'fileMain_s': 'https://hal.archives-ouvertes.fr/hal-00000019/document',
 'files_s': ['https://hal.archives-ouvertes.fr/hal-00000019/file/nature.pdf']}
cakiki commented 2 years ago
language files
en 614,053
fr ✔️ 402,232
undetermined 54,033
es 6,067
it 2,024
pt 1,794
de 1,408
ru 557
eu 213
uk 205
zh 201
ja 130
ar 109
pl 109
el 106
hy 95
cs 93
ro 67
oc 56
ca 55
da 54
mr 39
tr 38
vi 34
ko 34
sq 33
nl 33
bg 28
br 21
fa 21
eo 20
id 16
mg 15
hu 15
sv 10
te 9
hr 8
fi 8
no 8
sr 7
he 7
et 7
qu 7
sk 6
lt 6
hi 5
la 5
ms 4
sw 4
ta 3
kk 3
gl 3
co 2
tl 2
mn 2
az 2
ne 2
so 2
mk 2
iu 2
sl 2
be 2
th 2
fl 1
km 1
gn 1
ie 1
bm 1
is 1
ba 1
se 1
bs 1
fo 1
af 1
tk 1
lv 1
sa 1
zu 1
bo 1
0 1
ur 1