Closed melvelet closed 1 year ago
Thanks for reporting, @melvelet.
The reason for this bug is that this script (and potentially others as well) uses os.listdir
to iterate over the filenames:
for filename in os.listdir(folder_path)
And the Python documentation states about os.listdir
: https://docs.python.org/3/library/os.html#os.listdir
Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order...
That means the order of the files returned by os.listdir
is not deterministoc and depends on multiple factors, as the underlying operating system,...
Now that the PRs above have been merged, this issue can be closed as completed.
thanks for this @albertvillanova !
Describe the bug
In euadr (and potentially other datasets), the ID, document ID and order are dependent on operating system. Individual files of the original are read to create the BigBIO dataset and the loading order is most likely dependent on the operating system.
Steps to reproduce the bug
Expected results
In unix-based system: {'id': '0', 'document_id': '0', 'passages': [{'id': '1', 'type': 'title', 'text': ['Clinicopathological evaluation of immunohistochemical Ki-67 and endothelial nitric oxide synthase expression in intracranial ependymoma.'], ...
Actual results
In Windows 10: {'id': '0', 'document_id': '0', 'passages': [{'id': '1', 'type': 'title', 'text': ['A sequential procedure for monitoring clinical trials against historical controls.'], ...
Environment info
datasets
version: