gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Multimedia extension having a filename as an identifier results in malformed URLs #713

Open sadeghim opened 2 years ago

sadeghim commented 2 years ago

There is a use case in ALA to ingest a DwCa with embedded images in the zip file with multimedia extension that has identifier like:

identifier ...
654355.jpg ...
654356.jpg ...

The process is to load the images onto our image-service before ingesting the DwCa via pipeline. I was expecting that if I leave the identifiers as they are, they will be matched with the loaded images and the occurrences will have the uuid of the loaded images. But after investigating the issue found out that la-pipelines add http:// before the image names as identifiers.
The verbatim.avro looks right:

[http://rs.gbif.org/terms/1.0/Multimedia -> 
  [[http://purl.org/dc/terms/format -> image/jpeg, 
     http://purl.org/dc/terms/license -> https://creativecommons.org/licenses/by/4.0/, 
     http://purl.org/dc/terms/rightsHolder -> Museums Victoria, 
     http://purl.org/dc/terms/source -> Museums Victoria, 
     http://purl.org/dc/terms/creator -> Hewish, Marilyn, 
     http://purl.org/dc/terms/identifier -> 654355.jpg,
     http://purl.org/dc/terms/type -> StillImage, 
     http://purl.org/dc/terms/references -> https://collections.museumsvictoria.com.au/specimens/2468879, 
     http://purl.org/dc/terms/title -> Sedenia rupalis, 
     http://purl.org/dc/terms/publisher -> Museums Victoria, 
     http://purl.org/dc/terms/description -> Sedenia rupalis, Crambid moth. Grampians National Park, Victoria.],
...

But the avro files in images-load/new-images/* aren't:

type format identifier
StillImage image/jpeg http://4237.jpg
sadeghim commented 2 years ago

@djtfmartin could you please have a look at this one? thanks

sadeghim commented 2 years ago

@djtfmartin any luck to look at this one? thanks.

djtfmartin commented 2 years ago

The http:// is added to the identifier field by this line code:

https://github.com/gbif/pipelines/blob/dev/sdks/core/src/main/java/org/gbif/pipelines/core/interpreters/extension/MultimediaInterpreter.java#L132

  private static String parseAndSetIdentifier(Multimedia m, String v) {
    URI uri = UrlParser.parse(v);
    Optional<URI> uriOpt = Optional.ofNullable(uri);
    if (uriOpt.isPresent()) {
      Optional<String> opt = uriOpt.map(URI::toString);
      if (opt.isPresent()) {
        opt.ifPresent(m::setIdentifier);
      } else {
        return OccurrenceIssue.MULTIMEDIA_URI_INVALID.name();
      }
    } else {
      return OccurrenceIssue.MULTIMEDIA_URI_INVALID.name();
    }
    return "";

It is tempting to change this so the parsing is kept to check for invalid URIs, but the original identifier value is passed through unchanged, but im not clear on the downstream consequences for GBIF.

cc @muttcg any thoughts ?

MattBlissett commented 2 years ago

Would it be possible to test for the existence of the image file? I suspect not, at least not in every case.

There are plenty of bad values in the associatedMedia field, like https://www.gbif.org/occurrence/1322184705, 1819203515, 3421630418, 876326522, 2465040970, 1999046284, 1500395476 ...

sadeghim commented 2 years ago

Hi guys, any progress on this?

timrobertson100 commented 2 years ago

If the infrastructure is going to be serving the image, I'd suggest the better way to approach this is to put in a real HTTP URI. E.g. If the dataset has ./images/myImage.jpg in the CSV, and that file is bundled in the zip, it would be better that be served on https://ala.org.au/images/dataset123/myImage.jpg or so, and the accessible URL be put into the Occurrence/Media metadata objects.

Having locally referenced files in the occurrence records that are not accessible seems pointless for consumers in a public API (i.e. users of .../occurrence/123).

MattBlissett commented 2 years ago

I think that's the point: DWCAs with image.jpg should be interpreted to a suitable URL, but this needs to happen only for the image.jpg if that actually exists in the DWCA, because there are thousands of values like 1234234 etc in that field.

MattBlissett commented 2 years ago

The suitable URL for GBIF will be https://source-archive.gbif.org/<datasetKey>/, for example https://source-archive.gbif.org/7ddf754f-d193-4cc9-b351-99906754a03b/logo.png or https://source-archive.gbif.org/044f96bc-3bf2-4a38-9f7c-8808ab48dbf1/image/StonyRise-TaxonImage-9_Plutellidae_Plutella_xylostella.jpg

So if we have a DWCA:

meta.xml
occurrence.csv
images/moth1.jpg
images/moth2.jpg

then during interpretation, an associatedMedia: images/moth1.jpg or similar using an image extension can be interpreted to https://source-archive.gbif.org/<datasetKey>/images/moth1.jpg.