ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

What does the ID do? #191

Closed TheTechRobo closed 2 years ago

TheTechRobo commented 2 years ago

What does the ID for a crawl do?

acrois commented 2 years ago

The ID is used for resuming a crawl, see here: #57

Related: #185

I saw you responded here: #58 so I'll assume that you are aware it is not 100% complete.

The benefit of supplying the ID before the program is run so that you do not have to do discovery of that ID afterwards. It simplifies orchestration to do this. Hopefully support will be finalized soon. I predict the need for the functionality for my use case. I will look more into what limitations the software has and see what it takes to add. If it's feasible I'll consider working on that!

I do not see any other user than relating to the currently active crawl job. Once the crawl is complete you may be able to use that job ID again (may depend on database state). Unless someone knows for sure on this point, it will have to be tried to see what happens and know for sure how it behaves.

I wouldn't use the same job ID over again, personally. Seems error-prone. It's not a large step to just generate the ID externally with a random sequence then check for that sequence to be running. I'd only use it for a restart.

Here's the logic that creates the ID default string: https://github.com/ArchiveTeam/grab-site/blob/132064a24eeedbad2881128f932fca8b0c56ac64/libgrabsite/main.py#L207-L208

Hope it helps.

ivan commented 2 years ago

Yeah, as far as I remember, it's just an arbitrary ID used for tracking grab-site tasks between the server and dashboard. The first 8 hex-formatted bytes of it are also suffixed to the folder and WARC filenames.

TheTechRobo commented 2 years ago

Thanks!