apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.42k stars 308 forks source link

Implement/document a way how to pass extra configuration to json.dump() #526

Closed honzajavorek closed 1 week ago

honzajavorek commented 1 month ago

There is useful configuration to json.dump() which I'd like to pass through await crawler.export_data("export.json"), but I see no way to do that:

The only workaround I can think of right now is something convoluted like:

from pathlib import Path

path = Path("export.json")
await crawler.export_data(path)
path.write_text(json.dumps(json.loads(path.read_text()), ensure_ascii=False, indent=2))
vdusek commented 1 month ago

Hi @honzajavorek, thanks for your input. We are going to add a new option for providing additional keyword arguments for export_data.

honzajavorek commented 1 month ago

Thanks for considering this!

janbuchar commented 1 month ago

Just my $.02 - there is the issue of export_data being the kind of 80:20 helper that decides the output format based on the destination filename. Adding JSON-specific kwargs to this doesn't feel right, and we discussed some other options - everybody, feel free to speak your mind :slightly_smiling_face:

  1. just add kwargs for both json, csv and whatever other format we may add in the future
  2. make a separate export helper for each format on the BasicCrawler, e.g. export_data_json and export_data_csv
  3. keep just export_data, but have it accept something like str | JsonExportOptions | CsvExportOptions... the path could either be part of both option object types, or there could be two parameters - path and options. The decision tree in the implementation may be larger, but manageable and testable
  4. keep everything the way it is and publish an example how to do this now - something like json.dump(crawler.get_data(), dest_file, indent=2) should probably do it
B4nan commented 1 month ago

I would go for 2, maybe combined with 1, but mentioning the format-specific methods in the export_data comment would do the job too, if you want to configure something specific to JSON or CSV, it makes sense to use a method specific for that format.

honzajavorek commented 1 month ago

My hunch would be that the current .export_data() method is unnecessarily "magical". It implicitly decides format based on extension, which isn't clear from the outside, and will work 90 % of time, but surely there are edge cases when the extension won't match or won't be, for some reason, either .json or .csv. Also my issue shows that it perhaps does too many things, because once we need to pass JSON export options, it doesn't feel right in the interface, as there's also CSV. And CSV has a ton of compatibility options itself! Hence I tend to think that this method should be two explicit methods, which have the potential to take care of the specifics of individual formats. Also in the future, adding or deprecating a format is, IMHO, done in a cleaner way with separate functions.

0xSolanaceae commented 1 month ago

Is this something I can work on?

janbuchar commented 1 month ago

Is this something I can work on?

Absolutely, open a PR when you're ready.