bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
459 stars 115 forks source link

euadr: Document ID and order dependent on operating system #746

Closed melvelet closed 1 year ago

melvelet commented 1 year ago

Describe the bug

In euadr (and potentially other datasets), the ID, document ID and order are dependent on operating system. Individual files of the original are read to create the BigBIO dataset and the loading order is most likely dependent on the operating system.

Steps to reproduce the bug

# Sample code to reproduce the bug
from bigbio.dataloader import BigBioConfigHelpers
conhelps = BigBioConfigHelpers()
dataset_name = 'euadr_bigbio_kb'
dataset = conhelps.for_config_name(dataset_name).load_dataset()
print(dataset['train'][0])

Expected results

In unix-based system: {'id': '0', 'document_id': '0', 'passages': [{'id': '1', 'type': 'title', 'text': ['Clinicopathological evaluation of immunohistochemical Ki-67 and endothelial nitric oxide synthase expression in intracranial ependymoma.'], ...

Actual results

In Windows 10: {'id': '0', 'document_id': '0', 'passages': [{'id': '1', 'type': 'title', 'text': ['A sequential procedure for monitoring clinical trials against historical controls.'], ...

Environment info

albertvillanova commented 1 year ago

Thanks for reporting, @melvelet.

The reason for this bug is that this script (and potentially others as well) uses os.listdir to iterate over the filenames:

for filename in os.listdir(folder_path)

And the Python documentation states about os.listdir: https://docs.python.org/3/library/os.html#os.listdir

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order...

That means the order of the files returned by os.listdir is not deterministoc and depends on multiple factors, as the underlying operating system,...

albertvillanova commented 1 year ago

self-assign

albertvillanova commented 1 year ago

I have opened PRs on the Hub to fix this issue:

albertvillanova commented 1 year ago

Now that the PRs above have been merged, this issue can be closed as completed.

galtay commented 1 year ago

thanks for this @albertvillanova !