dglazkov / polymath

MIT License
133 stars 9 forks source link

Figure out a sustainable way to do importers #100

Open jkomoros opened 1 year ago

jkomoros commented 1 year ago

a718b684d9f857694c1545f574e4599c39c48575 happened because there were some imports necessary for particular converters that weren't included. In general the number of importers has ballooned (which is great), but that means that there are a ton of dependencies that aren't actually used for simple hosting.

Some importers are very general (e.g. SubstackImporter) and some are more specific (e.g. ReactRouterImporter).

We should over time get this under control.

The most obvious thing is to move the importer package to a separate repo with a different requirements. In the future ideally we'd have some kind of late-binding so you don't actually need dependencies for unused importers until you actually use them. And then in the very far future we'd likely have something where each importer is a separate package and there's an import command that knows how to fetch importers when requested and find ones from some directory of known importers that is maintained.

We should also probably do something to make it easier to change the signature . Maybe have an abstract base class and use e.g. @override

dglazkov commented 1 year ago

My sense is that we need to have a two-stage importer pipeline:

  1. Stage one is open to everyone and is effectively the wild west: might even be a separate repo. The output of this stage is some JSON format (proto-library?)
  2. Stage two is inside of the polymath package and is generic, focused on the right cleaning, chunking, etc.
dalmaer commented 1 year ago

There are a couple of sep concerns here IMO:

1) make it easy to build an importer using shared infra. I find that I do the same thing again and again:

2) let an importer register itself. Invert the concern so by loading an importer it registers etc. Then you don't need dependencies unless you are actually using a particular importer.