IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Closes #110 | IndoTacos: add dataloader #294

Closed faridlazuarda closed 1 year ago

faridlazuarda commented 1 year ago

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

tests.test_nusantara

$ python -m tests.test_nusantara nusacrowd/nusa_datasets/casa/casa.py
INFO:__main__:args: Namespace(data_dir=None, path='nusacrowd/nusa_datasets/casa/casa.py', schema=None, subset_id=None, use_auth_token=None)
INFO:__main__:self.PATH: nusacrowd/nusa_datasets/casa/casa.py
INFO:__main__:self.SUBSET_ID: casa
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module nusacrowd.nusa_datasets.casa.casa
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.ASPECT_BASED_SENTIMENT_ANALYSIS: 'ABSA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'TEXT_MULTI'}
INFO:__main__:schemas_to_check: {'TEXT_MULTI'}
INFO:__main__:Checking load_dataset with config name casa_source
Downloading and preparing dataset casa/casa_source to C:\Users\HP\.cache\huggingface\datasets\casa\casa_source\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5...
Dataset casa downloaded and prepared to C:\Users\HP\.cache\huggingface\datasets\casa\casa_source\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 77.54it/s]
INFO:__main__:Checking load_dataset with config name casa_nusantara_text_multi
Downloading and preparing dataset casa/casa_nusantara_text_multi to C:\Users\HP\.cache\huggingface\datasets\casa\casa_nusantara_text_multi\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5...
Dataset casa downloaded and prepared to C:\Users\HP\.cache\huggingface\datasets\casa\casa_nusantara_text_multi\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5. Subsequent calls will reuse this data.       
100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 83.33it/s] 
WARNING:datasets.builder:Reusing dataset casa (C:\Users\HP\.cache\huggingface\datasets\casa\casa_source\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5)
100%|█████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 500.06it/s] 
INFO:__main__:Dataset sample [source]
{'index': 0, 'sentence': 'Saya memakai Honda Jazz GK5 tahun 2014 ( pertama meluncur ) . Mobil nya bagus dan enak sesuai moto nya menyenangkan untuk dikendarai', 'fuel': 'neutral', 'machine': 'neutral', 'others': 'positive', 'part': 'neutral', 'price': 'neutral', 'service': 'neutral'}
WARNING:datasets.builder:Reusing dataset casa (C:\Users\HP\.cache\huggingface\datasets\casa\casa_nusantara_text_multi\1.0.0\ead9b7eb80f972c7f6f92addc90a4ce3911a89706b006777dcb7f7361a332dd5)
100%|█████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 500.14it/s]
INFO:__main__:Dataset sample [nusantara_text_multi]
{'id': '0', 'text': 'Saya memakai Honda Jazz GK5 tahun 2014 ( pertama meluncur ) . Mobil nya bagus dan enak sesuai moto nya menyenangkan untuk dikendarai', 'labels': [1, 1, 0, 1, 1, 1]}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 90 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 810
text: 810
labels: 4860

test
==========
id: 180
text: 180
labels: 1080

validation
==========
id: 90
text: 90
labels: 540

.
----------------------------------------------------------------------
Ran 1 test in 2.544s

OK

Load Dataset

>>> data = load_dataset("nusacrowd/nusa_datasets/indotacos/indotacos.py", name="indotacos_source")
Reusing dataset indo_tacos (C:\Users\HP\.cache\huggingface\datasets\indo_tacos\indotacos_source\1.0.0\813241f6055579fbb3ff721bebd2ac5915c1033a127e16002daddb6d1cd2a4a5)
100%|██████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.03it/s]
>>> data["train"].to_pandas()
                                                    text  ...                                     pokok_sengketa
0      RISALAH Putusan Pengadilan Pajak Nomor: Put-08...  ...  bahwa nilai sengketa dalam banding ini adalah ...  
1      RISALAH Putusan Pengadilan Pajak Nomor: Put-10...  ...  bahwa nilai sengketa terbukti dalam sengketa b...  
2      RISALAH Putusan Pengadilan Pajak Nomor: Put-08...  ...  bahwa nilai sengketa dalam banding ini adalah ...  
3      RISALAH Putusan Pengadilan Pajak Nomor: Put-08...  ...  bahwa nilai sengketa dalam banding ini adalah ...  
4      RISALAH Putusan Pengadilan Pajak Nomor: Put-08...  ...  bahwa nilai sengketa dalam sengketa banding in...  
...                                                  ...  ...                                                ...  
12286  RISALAH Putusan Pengadilan Pajak Nomor : Put-5...  ...  bahwa yang menjadi pokok sengketa adalah penga...  
12287  RISALAH Putusan Pengadilan Pajak Nomor : Put-5...  ...  bahwa yang menjadi pokok sengketa adalah penga...  
12288  RISALAH Putusan Pengadilan Pajak Nomor : Put-5...  ...  bahwa yang menjadi pokok sengketa adalah penga...  
12289  RISALAH Putusan Pengadilan Pajak Nomor : Put-5...  ...  bahwa yang menjadi pokok sengketa adalah penga...  
12290  RISALAH Putusan Pengadilan Pajak Nomor : Put-5...  ...  bahwa yang menjadi pokok sengketa adalah penga...  

[12291 rows x 6 columns]

>>> data["train"].to_pandas()
          id                                               text  label
0          0  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      0
1          1  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
2          2  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
3          3  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
4          4  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
...      ...                                                ...    ...
12286  12286  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
12287  12287  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
12288  12288  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
12289  12289  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
12290  12290  {'text': 'RISALAH Putusan Pengadilan Pajak Nom...      1
muhsatrio commented 1 year ago

/test dataset=indotacos

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3122928256

faridlazuarda commented 1 year ago

Appreciate the review! I have resolved the issues you have mentioned. Please kindly check, thank you!

SamuelCahyawijaya commented 1 year ago

/test dataset=indotacos

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3163036634