bminixhofer / nlprule

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Apache License 2.0
599 stars 39 forks source link

Be more responsible about network requests #84

Open ssokolow opened 2 years ago

ssokolow commented 2 years ago

When I tried entering an invalid language code to confirm that there's a Python exception I need to handle if the language code selected in my existing Enchant-based infrastructure isn't supported by nlprule, I got this very surprising error message:

ValueError: HTTP status client error (404 Not Found) for url (https://github.com/bminixhofer/nlprule/releases/download/0.6.4/ef_tokenizer.bin.gz)

Personally, I consider it very irresponsible to not warn people that a dependency is going to perform network requests under some circumstances, nor to provide an obvious way to handle things offline.

I highly recommend you change this and, for my own use, since I tend to incorporate PyO3-based stuff into my PyQt apps anyway, I think I'll probably switch to writing my own nlprule wrapper so I can trust that, if no network libraries show up in the Cargo.lock, and the author isn't being actively malicious, then what I build will work on an airgapped machine or in a networkless sandbox.

(Seriously. Sandboxes like Flatpak are becoming more and more common. Just assuming applications will have network access is not cool.)

bminixhofer commented 2 years ago

Good point. I don't quite agree on the magnitude of the issue though. I was taking inspiration from Huggingface's transformers .from_pretrained API which does basically the same thing. I do not see a big issue with network requests to a trusted URL.

You are very welcome to open a PR to check for availability offline. Otherwise I will leave this open and might get around to it at some point, but I am currently not actively developing this library so it will take time.

ssokolow commented 2 years ago

I do not see a big issue with network requests to a trusted URL.

I don't see it as a security thing for the program so much as a point of frustration when it comes time to build your distributables and find that either they're incomplete (eg. you think you're saving a complete .msi or .exe installer, only to discover that you're missing files when you're on deployment and military policy prevents you from just going online to grab them) or something like flatpak-builder or the Debian/Fedora/Gentoo/etc. packaging environment is erroring out because, for security and reproducibility reasons, they operate as follows:

  1. Download all dependencies without executing third-party code, based on a manifest written in something like YAML or JSON or TOML or whatever precursor APT and RPM use. (i.e. Like cargo fetch. Here's an example Flatpak manifest that I wrote for a feature request to package lgogdownloader on Flathub.)
  2. Run the build process inside a sandbox with no network access. (i.e. cargo build --offline in a sandbox)
  3. Install the package in a location that the application won't have write access to when run as an ordinary user... in the case of Flatpak or Snaps, also sandboxed such that the application won't have network access unless the manifest asked for it. (The --share=network under finish-args in the manifest I linked... which then shows up in an imposing "this program will be granted these permissions. Continue installing?"-style prompt similar to how browser extensions work, providing passive pressure to ask for fewer permissions.)

They have a pretty "no exceptions" attitude toward this separation of concerns, since they want to do the build themselves on their own server farm, and, for Flathub, you are the maintainer, so you can't just let someone else figure out how to work around such a road bump.