flairNLP / fundus

A very simple news crawler with a funny name
MIT License
275 stars 74 forks source link

[Question]: Best way to save crawled articles #529

Closed stefan-it closed 3 months ago

stefan-it commented 3 months ago

Question

Hi,

many thanks for releasing this great crawler! Particulary, the supported number of German publishers is amazing - I am planing to collect some articles for LM pretraining.

I opened this issue, because I couldn't find an example in the docs: what is the best and recommended way to export articles into e.g. a jsonl file? I could think of adding a to_json function to an Article object and then write it to a file :thinking:

But it would be great if the documention could also cover exporting articles :)

Many thanks in advance!

stefan-it commented 3 months ago

Pinging @MaxDall for help :)

stefan-it commented 3 months ago

For now I came up with the following solution:

image

alanakbik commented 3 months ago

Thanks @stefan-it for pointing this out!

I think it would be good for Fundus to offer support for serializing articles. We'd need some helper methods to serialize/deserialize articles. JSON seems like a good fit since it is human-readable. @addie9800 what do you think?

addie9800 commented 3 months ago

I definitely agree, also since we are already using JSON to represent the parsed articles within our tests. @MaxDall has also already started working on a solution implementing it.