Allegra-Cohen / grid

GNU General Public License v3.0
1 stars 3 forks source link

Order of corpus documents needs to be pinned down #64

Closed kwalcock closed 1 year ago

kwalcock commented 1 year ago

In corpus_parser.py, documents are read from the filesystem with os.listdir(). The order of the documents may depend on operating system or non-obvious qualities of the documents, such as when they were created or just when they were added to the directory after having been created elsewhere. This means that two people having the same collection of files (based on file names and content), can get different results. To get a repeatable process, one can sort them, preferably by an obvious quality like their names. That works well since the names have to be unique anyway and there is no need to break ties. Probably just a sort() or sorted() needs to be added.

kwalcock commented 1 year ago

Since the name no longer includes the extension, there can be ties. However, we can probably just say please don't do that.