Recipes Merging work & Miru Integration

b5 commented 7 years ago

As mentioned earlier today, one area that needs a little attention these days is the miru/recipes integration work for submitting code to archive uncrawlable content. @jeffreyliu kicked off what I think is a nice way forward today.

So, how about this as a rough outline on developing a set of recipes & understanding how this integration will work:

Use the code & analysis that the Boston DataRescue team has been doing on classifying different types of uncrawlable content as a starting point.
Fold in some light analysis of code submitted via scripts submitted in zip archives via the current archivers.space for language & approach preference.
Pick a few example urls from each class of uncrawlable content.
Write sweet scripts that crawl the examples.
Examine these scripts for the maximum output they can generate for reporting purposes & look for overlaps,
Start plugging this stuff into miru, wait a week, and see what breaks.
Document. Learn. Improve. Wash, rinse, repeat.

I'm hoping this can clear a path forward, this area of work is incredibly complex, so I think moving forward we should acknowledge that this is going to take a fair amount of time to get right, but it'll be super sweet & totally fantastic & gumdrops & rainbows when we get it right.

b5 commented 7 years ago

cc @zsck

mhucka commented 7 years ago

I've been reading a little bit about crawling so-called deep web resources. Here a few pointers to work I've found so far:

2008 paper about Google's Deep-Web Crawl, about searching for content hidden behind HTML forms.
2012 paper about optimizing access to databases accessible via search interfaces
2013 paper about discovering html query forms
2007 paper about adaptive approaches to finding hidden-web entry points
2008 paper about Siphon++
2010 paper about using reinforcement learning to learn how to crawl resources

There seems to be a fair amount of research work on this, going back a long way. A Stanford group had a paper in 2000 about crawling hidden web resources, but it may go back even farther.

Finally, searching around, I came across this Quora answer, which shocked me by being quite good and very detailed, with a lot of pointers to a lot more work in this area than what I've listed above. ("Shocked", because I am almost always disappointed by answers on Quora.)

(P.S. I confess that I have zero experience writing crawlers, so I can only say this looks relevant to the problem but can't tell if it's old hat or junk. Hopefully it's useful.)

jonganc commented 7 years ago

@mhucka: Those resources look interesting and different from what I have looked at previously. I need to investigate them.

dcwalk commented 7 years ago

I'm wondering @b5, @mhucka, and @jeffreyliu if the resources and this issue have been superceded by other repos and we pull out what is necessary and close?

mhucka commented 7 years ago

Hum. I forgot I even wrote that note above :-).

I think a good approach would be for me to start a section in https://github.com/archivers-space/research and put those links (+ other similar research) there. I was planning on doing that for some other research items anyway. It probably won't happen immediately but soon-ish.

dcwalk commented 7 years ago

@mhucka -- I agree!

b5 commented 7 years ago

closing this due to age

datatogether / roadmap

Recipes Merging work & Miru Integration #4