Open jkomoros opened 1 year ago
My sense is that we need to have a two-stage importer pipeline:
polymath
package and is generic, focused on the right cleaning, chunking, etc.There are a couple of sep concerns here IMO:
1) make it easy to build an importer using shared infra. I find that I do the same thing again and again:
doesn't work on many HTML inputs etc.
2) let an importer register itself. Invert the concern so by loading an importer it registers etc. Then you don't need dependencies unless you are actually using a particular importer.
a718b684d9f857694c1545f574e4599c39c48575 happened because there were some imports necessary for particular converters that weren't included. In general the number of importers has ballooned (which is great), but that means that there are a ton of dependencies that aren't actually used for simple hosting.
Some importers are very general (e.g. SubstackImporter) and some are more specific (e.g. ReactRouterImporter).
We should over time get this under control.
The most obvious thing is to move the importer package to a separate repo with a different requirements. In the future ideally we'd have some kind of late-binding so you don't actually need dependencies for unused importers until you actually use them. And then in the very far future we'd likely have something where each importer is a separate package and there's an import command that knows how to fetch importers when requested and find ones from some directory of known importers that is maintained.
We should also probably do something to make it easier to change the signature . Maybe have an abstract base class and use e.g.
@override