commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

produce WET files? #55

Open chris-ha458 opened 1 year ago

chris-ha458 commented 1 year ago

I'm not sure if this is the right place to ask this, (feel free to direct me where) But would it be possible to also produce WET files from this library?

Many downstream libraries of CC consume WET files (such as oscar-project/ungoliant) And it would be useful if there were WET files available alongside WARC files.

wumpus commented 1 year ago

This is a good idea, but as you can see from the other issues opened by @sebastian-nagel , we're short on engineering resources for news crawl work.

chris-ha458 commented 1 year ago

Thanks for the clarification. I will leave this issue open for further reference. Hopefully if it becomes relevant again, discussion can be done here. Otherwise, if it decided or considered not worth attempting (even beyond the issue of engineering resources shortages), feel free to close it.

eukaryoting commented 1 year ago

@wumpus as a partial solve, is there some up-to-date way that we can generate WET files ourselves from the news WARC files?

I'm trying to run the WET extractor, as per Sebastian's 2017 comments (https://groups.google.com/g/common-crawl/c/hsb90GHq6to), but running into some issues with building ia-hadoop-tools.

[edit: I've now found another issue related to this -- https://github.com/commoncrawl/ia-hadoop-tools/issues/4]

wumpus commented 1 year ago

In theory all of the code needed to make WETs is public from us, but unfortunately we have limited Sebastian time, and I am not so good at Java! If you come up with some better instructions, I'm happy to check them in somewhere. That's a great example of something that's in the mailing list archive that ought to be promoted to be directly visible and updated for modern versions.

eukaryoting commented 1 year ago

@wumpus Thanks for getting back to me. I got the WET extractor running in the end, just a small issue since ia-hadoop-tools doesn't build with recent Maven versions. I posted what worked back to https://groups.google.com/g/common-crawl/c/hsb90GHq6to/m/V5W-gUBbAgAJ

tfmorris commented 9 months ago

@eukaryoting if you could put that in the form of a pull request, I'd be happy to review it and @wumpus could get it committed.