gbif / gbif-common

Utility classes
Apache License 2.0
1 stars 1 forks source link

GNU sort using the fieldsEnclosedBy character #27

Closed marcos-lg closed 1 year ago

marcos-lg commented 2 years ago

We found a case of a dataset where the GNU sort is not working as expected because it's not taking the fieldsEnclosedBy character into account.

I'm explaining the issue first for context:

This is a sampling event dataset and it has 2 extensions: occurrence and measurementOrFact (it's in DEV only https://registry.gbif-dev.org/dataset/10fd6b56-99fd-49e1-863e-09480dfb67c9).

Most of the IDs of the dataset are like these:

"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...

and there are 2 that are:

2087
CPJGI0057476

The occurrences are linked only to the events with IDs like:

"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418097:event"
"urn:catalog:NSW Dept of Planning, Industry and Environment:BioNet Atlas of NSW Wildlife:SPJGI4418098:event"
...

And the records in the measurementOrFact extension are only linked to the events:

2087
CPJGI0057476

When we are reading the archive and it's sorted using the GNU sort the "urn:catalog:... IDs are sorted always the first and the other 2 are the last. And I think it's because it's taking the quotes into account. Then when we parse the measurementOrFact extension in java(the occurrence extension is parsed correctly), the records with the "urn:catalog:... come first and they can't find any match in the extension making the extension iterator reach the end and when the records with the other ids come the iterator doesn't have more values. So the measurementOrFact extension is always empty for all the records when reading the archive.

In other words, all the extension records fall in this if because it starts comparing the extension ids(2087 and CPJGI0057476) with the urn:catalog:... ids first:

https://github.com/gbif/dwca-io/blob/7cd05e21ebbc0dece62c9e73be41e2e898959073/src/main/java/org/gbif/dwc/StarRecordIterator.java#L126

} else if (id.compareTo(extId) > 0) {
    // this extension id is smaller than the core id and should have been picked up by a core record already
    // seems to have no matching core record, so lets skip it
    it.next();
    extensionRecordsSkipped.put(rowType, extensionRecordsSkipped.get(rowType) + 1);
  } 

I tested it using the java sort and it works as expected since it takes the urn ids as Strings and it doesn't contain the quotes.

So before considering other options, I was wondering if it's possible not to take the quotes (or whatever character defined in fieldsEnclosedBy) into account when doing the GNU sort?

MattBlissett commented 2 years ago

I don't think there's any reasonable alternative but to avoid GNU sort when there is a fieldsEnclosedBy set.

If coreId fields may or may not be quoted within the same file, then GNU sort's result will be the correct sort (" is Ascii character 0x22), but the Java sort is configured to ignore the quotes using a LineComparator.