Closed b5 closed 7 years ago
cc @zsck
I've been reading a little bit about crawling so-called deep web resources. Here a few pointers to work I've found so far:
2008 paper about Google's Deep-Web Crawl, about searching for content hidden behind HTML forms.
2012 paper about optimizing access to databases accessible via search interfaces
2013 paper about discovering html query forms
2007 paper about adaptive approaches to finding hidden-web entry points
2008 paper about Siphon++
2010 paper about using reinforcement learning to learn how to crawl resources
There seems to be a fair amount of research work on this, going back a long way. A Stanford group had a paper in 2000 about crawling hidden web resources, but it may go back even farther.
Finally, searching around, I came across this Quora answer, which shocked me by being quite good and very detailed, with a lot of pointers to a lot more work in this area than what I've listed above. ("Shocked", because I am almost always disappointed by answers on Quora.)
(P.S. I confess that I have zero experience writing crawlers, so I can only say this looks relevant to the problem but can't tell if it's old hat or junk. Hopefully it's useful.)
@mhucka: Those resources look interesting and different from what I have looked at previously. I need to investigate them.
I'm wondering @b5, @mhucka, and @jeffreyliu if the resources and this issue have been superceded by other repos and we pull out what is necessary and close?
Hum. I forgot I even wrote that note above :-).
I think a good approach would be for me to start a section in https://github.com/archivers-space/research and put those links (+ other similar research) there. I was planning on doing that for some other research items anyway. It probably won't happen immediately but soon-ish.
@mhucka -- I agree!
closing this due to age
As mentioned earlier today, one area that needs a little attention these days is the miru/recipes integration work for submitting code to archive uncrawlable content. @jeffreyliu kicked off what I think is a nice way forward today.
So, how about this as a rough outline on developing a set of recipes & understanding how this integration will work:
I'm hoping this can clear a path forward, this area of work is incredibly complex, so I think moving forward we should acknowledge that this is going to take a fair amount of time to get right, but it'll be super sweet & totally fantastic & gumdrops & rainbows when we get it right.