what resources should be deleted from the metadata/search index?

AusDTO / disco_layer

Code, outputs and Information relevant to the discovery layer.

1 stars 5 forks source link

what resources should be deleted from the metadata/search index? #76

Open monkeypants opened 9 years ago

monkeypants commented 9 years ago

following on from #66, what also need to delete resources from the index when they are discovered to be missing/removed by the crawler.

@nokout, how does a deleted resouce appear in the crawler's DB when it should be deleted from the search index?

See for example the raw SQL queries in disco_service/crawler/tasks.py; what similar query should list "things that should be deleted from metadata_resource", assuming we want to delete from that table when we don't want it in the search index anymore

nokout commented 9 years ago

Just to be painful there are probably a couple of things we need to look at. A null hash - there is no longer a document there, httpCode the status of a document (eg, we might get a 404 or 500 page but should ignore that content), we also probably also need to think about looking at the fetchStatus, if that is error then we should not trust any of the other fields - even if they say 200 . Helpful?

monkeypants commented 9 years ago

So delete from metadata if metadata.url == webDocument.url AND:

(metadata.hash is not null AND webDocument.hash is null) OR
(webDocument.httpCode != (2xx or 3xx)) OR
(fetchStatus == <WHAT?>)

right?

monkeypants commented 9 years ago

I think to finish that / close the ticket, I just need to know: what fetchStatus should be deleted from the index?

nokout commented 9 years ago

I think, "failed", "notfound", "redirected" should be canned. I'm not sure about timeout, its a lineball, timeout is only 20 sec at the moment.

My gut feeling is delete the timeouts too. We will still know about them and can look at more finessed approaches later.

Id really like to be creating a score which increases each time we have an issue with a url, that would then drive the refetch window and the actions to take in the index. But I think thats what we need to spend more time on in the next iteration.

monkeypants commented 9 years ago

if we were deleting from metadata_resource when we discover a timeout in WebDocument, what if we also incremented a "delete_count" field in WebDocument? It wouldn't be perfect, but you could combine it with age to get an indication of flakeyness.

On Mon, Aug 10, 2015 at 11:49 PM, nokout notifications@github.com wrote:

I think, "failed", "notfound", "redirected" should be canned. I'm not sure about timeout, its a lineball, timeout is only 20 sec at the moment.

My gut feeling is delete the timeouts too. We will still know about them and can look at more finessed approaches later.

Id really like to be creating a score which increases each time we have an issue with a url, that would then drive the refetch window and the actions to take in the index. But I think thats what we need to spend more time on in the next iteration.

— Reply to this email directly or view it on GitHub https://github.com/AusDTO/discoveryLayer/issues/76#issuecomment-129460287 .

nokout commented 9 years ago

Sorry, thought I had already answered this one. I think thats the right approach but is prob in the next iteration.

nokout commented 9 years ago

@auxesis @monkeypants @mmck-dto can someone please close or reassign