DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Issue #89 - add a command line flag to export for sanitize level #93

Closed janetriley closed 11 months ago

janetriley commented 7 years ago
coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.5%) to 71.496% when pulling d33dcd0477bf31098c36da335ad61cfa3b676e83 on janetriley:feature-89_add_sanitize_flag_to_exporter into 88d5d7c57e7864bc50738726d5331b56c2c658df on DistrictDataLabs:develop.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.5%) to 71.496% when pulling 1815cfa1847a2e1114fb6370b1c7692b8dad04c3 on janetriley:feature-89_add_sanitize_flag_to_exporter into 88d5d7c57e7864bc50738726d5331b56c2c658df on DistrictDataLabs:develop.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.8%) to 71.863% when pulling 24b9d147186cd1b87cf12994959b2f34c0058e94 on janetriley:feature-89_add_sanitize_flag_to_exporter into 88d5d7c57e7864bc50738726d5331b56c2c658df on DistrictDataLabs:develop.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.5%) to 71.496% when pulling 892c03ce1fddfc3c1df2730c74424edfa6a45d97 on janetriley:feature-89_add_sanitize_flag_to_exporter into 88d5d7c57e7864bc50738726d5331b56c2c658df on DistrictDataLabs:develop.

janetriley commented 7 years ago

The advantage of creating all category directories at the start is it raises an exception immediately if there's an issue creating the directory, rather than throwing it in the midst of an export. That seems like a worthwhile tradeoff once you have a big corpus.

What do you think, @bbengfort , back it out?

I'm not thrilled with raising the ExportException for FileNotFound in the midst of the loop. It seems odd to single that one exception type out. I can:

What's your preference?

bbengfort commented 7 years ago

I'd say go ahead and create all the directories at the beginning - an empty directory is no big deal, and I think you're right that we don't want to throw an exception in the middle of the export.

Which brings us to the exceptions that are thrown in the middle of the export. A couple of points:

Any exception that is not of type BaleenError should be considered a fatal developer error. I think that's why the FileNotFound exception is being caught and wrapped in ExportError because any other type of IOError should simply cause the program to crash and the developers have to deal with it as a bug (which you expressed in your point about so many ways for I/O to go wrong).

At this point though Baleen should be mature enough to gracefully handle exceptions, we've been running it long enough and haven't observed too much craziness.

For now, I would say that we should catch and reraise any exception as an ExportError, since they are being handled in a specific way and require the errno and other fields.

In the future, we could collect all errors in exporter.errors and show them to the user after export (or at least classify non fatal vs fatal export exceptions)

bbengfort commented 7 years ago

Speaking of specific error handling and fatal vs. non fatal - most of that happens at the logger level.