Closed honzajavorek closed 1 week ago
Hi @honzajavorek, thanks for your input. We are going to add a new option for providing additional keyword arguments for export_data
.
Thanks for considering this!
Just my $.02 - there is the issue of export_data
being the kind of 80:20 helper that decides the output format based on the destination filename. Adding JSON-specific kwargs to this doesn't feel right, and we discussed some other options - everybody, feel free to speak your mind :slightly_smiling_face:
BasicCrawler
, e.g. export_data_json
and export_data_csv
export_data
, but have it accept something like str | JsonExportOptions | CsvExportOptions
... the path could either be part of both option object types, or there could be two parameters - path and options. The decision tree in the implementation may be larger, but manageable and testablejson.dump(crawler.get_data(), dest_file, indent=2)
should probably do itI would go for 2, maybe combined with 1, but mentioning the format-specific methods in the export_data
comment would do the job too, if you want to configure something specific to JSON or CSV, it makes sense to use a method specific for that format.
My hunch would be that the current .export_data()
method is unnecessarily "magical". It implicitly decides format based on extension, which isn't clear from the outside, and will work 90 % of time, but surely there are edge cases when the extension won't match or won't be, for some reason, either .json
or .csv
. Also my issue shows that it perhaps does too many things, because once we need to pass JSON export options, it doesn't feel right in the interface, as there's also CSV. And CSV has a ton of compatibility options itself! Hence I tend to think that this method should be two explicit methods, which have the potential to take care of the specifics of individual formats. Also in the future, adding or deprecating a format is, IMHO, done in a cleaner way with separate functions.
Is this something I can work on?
Is this something I can work on?
Absolutely, open a PR when you're ready.
There is useful configuration to
json.dump()
which I'd like to pass throughawait crawler.export_data("export.json")
, but I see no way to do that:ensure_ascii
- as someone living in a country using extended latin, setting this toFalse
prevents Python to encode half of the characters as weird messindent
- allows me to read the output as a mere humansort_keys
- may be useful for git scraping, not sureThe only workaround I can think of right now is something convoluted like: