Detect media type from opaque URLs

gbif / parsers

Various GBIF parsers for dates, countries, language, taxon ranks, etc

Apache License 2.0

4 stars 8 forks source link

Detect media type from opaque URLs #11

Closed mdoering closed 4 years ago

mdoering commented 6 years ago

The mime type of URIs is detected by using Apache TIKA: https://github.com/gbif/parsers/blob/master/src/main/java/org/gbif/common/parsers/MediaParser.java#L93

Knowing the mime type is essential for interpreting media objects and will determine if a URL is taken as an image, audio or a link to a webpage:

https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/MultiMediaInterpreter.java#L106

Opaque URLs without file suffices will result in html mime types thus links. We need to improve this and look at the actual content type returned by an http request. A small HEAD request should be sufficient. As it requires a lot more resources to call all URIs the parser should probably be configurable to turn this feature on or off

MattBlissett commented 6 years ago

https://www.gbif.org/occurrence/1137400685 is an example.

There are 4 million items in this collection, so a HEAD request is still too much. We could add some regular expressions to the parser, and populate it with patterns we know to be OK.

mdoering commented 4 years ago

I like the regex patch. But in order to deal with upcoming, new URLs would it not be useful to try some HEAD requests for the same patterns found in data? We dont need to scan all 4 million items in the collection above. But doing 10 or even 100 HEAD requests for the same base URI could be an option so that the parser can be trained. Ideally we would then also persist the trained knowledge ...

timrobertson100 commented 4 years ago

We process this in parallel, so it's difficult to capture that state.

One option for this could be machine tags @MattBlissett (e.g. gbif.org assumeImageURLs=true)

MattBlissett commented 4 years ago

I didn't want to overengineer anything.

It's a very small number of datasets that are affected -- only those where the publisher is

using associatedMultimedia rather than any of the three multimedia extensions
not using plain image URLs, which would have normal extensions
instead using some sort of image handling software which doesn't have .jpg or similar in the URL

The 7 regexes match all image/sound files currently not parsing correctly, where there are more than 25 such URLs on the same server.

(A default value for http://purl.org/dc/terms/format doesn't work, it would need some work in pipelines.)