Closed TheTechRobo closed 2 years ago
The ID is used for resuming a crawl, see here: #57
Related: #185
I saw you responded here: #58 so I'll assume that you are aware it is not 100% complete.
The benefit of supplying the ID before the program is run so that you do not have to do discovery of that ID afterwards. It simplifies orchestration to do this. Hopefully support will be finalized soon. I predict the need for the functionality for my use case. I will look more into what limitations the software has and see what it takes to add. If it's feasible I'll consider working on that!
I do not see any other user than relating to the currently active crawl job. Once the crawl is complete you may be able to use that job ID again (may depend on database state). Unless someone knows for sure on this point, it will have to be tried to see what happens and know for sure how it behaves.
I wouldn't use the same job ID over again, personally. Seems error-prone. It's not a large step to just generate the ID externally with a random sequence then check for that sequence to be running. I'd only use it for a restart.
Here's the logic that creates the ID default string: https://github.com/ArchiveTeam/grab-site/blob/132064a24eeedbad2881128f932fca8b0c56ac64/libgrabsite/main.py#L207-L208
Hope it helps.
Yeah, as far as I remember, it's just an arbitrary ID used for tracking grab-site tasks between the server and dashboard. The first 8 hex-formatted bytes of it are also suffixed to the folder and WARC filenames.
Thanks!
What does the ID for a crawl do?