Closed jetnet closed 2 years ago
Do you have an example? Deletion requests are sent to your Committer, so maybe you can write a committer that does exactly what you want?
we create image thumbnails storing them on the file system. When an image gets deleted from the source, I'd like to delete the corresponding thumbnail-file from the file system.
We use ES-committer.
I was thinking of ExternalTransformer
, which could be called in the deletion workflow as well: the document's reference should be enough to do the needful.
You are right, the best way is to write a simple custom committer, that would delete external files. But if you have time, could you please share your thoughts regarding this feature - adding a processing part for deleted docs on the connector or imported side? Thanks!
How are you creating the thumbnails? By "creating" you mean a screen capture of the page? Is there a way for you to reference the image file you created? If so, in your Committer, when you get a document deletion, you can delete the associated image.
If you are instead saving an existing image file referenced from within the document, then I think your concern is to delete just the image if that image does not exist (but the document itself still exists).
If so, one approach could be that in addition to storing the image locally, you can have it processed normally (crawled as a standalone document). Your own committer could recognize images and not send them to ES. What that will do is it will fool the crawl store to think that image was crawled properly and it will keep a reference to it to check if it changed on subsequent crawls. So if the source image gets deleted, you will get a committer deletion request and you can delete the document on file that correspond.
I am not sure what you do with these images, but can also point to them (storing their URL) instead of keeping local copies.
we use ExternalTransformer
with a shell script to create thumbnails for downloaded images. Thumbnails are needed to show them on the result page, as the source ones could be huge to load and show them on the preview page.
Elasticsearch gets only metadata for the images.
When a source image gets deleted, we should delete its thumbnail as well. So, possible solutions now:
URLStatusCrawlerEventListener
and delete corresponding thumbnails with a scriptWith version 3.0.0, this can be accomplished using an event listener. You can imlpement Interface IEventListener
A quick example (untested):
import com.norconex.collector.core.crawler.CrawlerEvent;
import com.norconex.commons.lang.event.Event;
import com.norconex.commons.lang.event.IEventListener;
public class DeleteImageOnNotFoundListener implements IEventListener<Event> {
@Override
public void accept(Event event) {
if (event.is(CrawlerEvent.REJECTED_NOTFOUND)) {
String url = ((CrawlerEvent) event).getCrawlDocInfo().getReference();
//TODO delete the image associated with this URL.
}
}
}
It would be great to do some extra work upon deletion process
Thanks a lot!