In corpus_parser.py, documents are read from the filesystem with os.listdir(). The order of the documents may depend on operating system or non-obvious qualities of the documents, such as when they were created or just when they were added to the directory after having been created elsewhere. This means that two people having the same collection of files (based on file names and content), can get different results. To get a repeatable process, one can sort them, preferably by an obvious quality like their names. That works well since the names have to be unique anyway and there is no need to break ties. Probably just a sort() or sorted() needs to be added.
In corpus_parser.py, documents are read from the filesystem with os.listdir(). The order of the documents may depend on operating system or non-obvious qualities of the documents, such as when they were created or just when they were added to the directory after having been created elsewhere. This means that two people having the same collection of files (based on file names and content), can get different results. To get a repeatable process, one can sort them, preferably by an obvious quality like their names. That works well since the names have to be unique anyway and there is no need to break ties. Probably just a sort() or sorted() needs to be added.