DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
85 stars 38 forks source link

Export to directory other than '.' fails #95

Open agodbehere opened 6 years ago

agodbehere commented 6 years ago

Issue

Exporting to a directory such as corpora/ with bin/baleen export corpora results in an error like: [Errno 2] No such file or directory: 'corpora/corpora/cooking/5b2d180b7af8b43e439b59b0.json'

This is a path expansion bug, as the second corpora/ in the path is not the desired behavior.

Resolution

The fix is straightforward. In version v0.3.3-85-g88d5d7c, line 211, remove self.root,.

So, for the block that reads:

for post, category in tqdm(self.posts(), total=Post.objects.count(), unit="docs"):
    path = os.path.join(
        self.root, catdir[category], "{}.{}".format(post.id, self.scheme)
    )

the revision should be:

for post, category in tqdm(self.posts(), total=Post.objects.count(), unit="docs"):
    path = os.path.join(
        catdir[category], "{}.{}".format(post.id, self.scheme)
    )

This change results in the desired behavior on export.

bbengfort commented 6 years ago

Thanks @agodbehere for the bug report and the clear solution! You're right, there was a duplication of self.root in catdir[category]; I've implemented the change you suggested.