Closed mdoering closed 4 years ago
https://www.gbif.org/occurrence/1137400685 is an example.
There are 4 million items in this collection, so a HEAD request is still too much. We could add some regular expressions to the parser, and populate it with patterns we know to be OK.
I like the regex patch. But in order to deal with upcoming, new URLs would it not be useful to try some HEAD requests for the same patterns found in data? We dont need to scan all 4 million items in the collection above. But doing 10 or even 100 HEAD requests for the same base URI could be an option so that the parser can be trained. Ideally we would then also persist the trained knowledge ...
We process this in parallel, so it's difficult to capture that state.
One option for this could be machine tags @MattBlissett (e.g. gbif.org assumeImageURLs=true)
I didn't want to overengineer anything.
It's a very small number of datasets that are affected -- only those where the publisher is
.jpg
or similar in the URLThe 7 regexes match all image/sound files currently not parsing correctly, where there are more than 25 such URLs on the same server.
(A default value for http://purl.org/dc/terms/format
doesn't work, it would need some work in pipelines.)
The mime type of URIs is detected by using Apache TIKA: https://github.com/gbif/parsers/blob/master/src/main/java/org/gbif/common/parsers/MediaParser.java#L93
Knowing the mime type is essential for interpreting media objects and will determine if a URL is taken as an image, audio or a link to a webpage:
https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/MultiMediaInterpreter.java#L106
Opaque URLs without file suffices will result in html mime types thus links. We need to improve this and look at the actual content type returned by an http request. A small HEAD request should be sufficient. As it requires a lot more resources to call all URIs the parser should probably be configurable to turn this feature on or off