gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Order of columns in generated DwC-A changes between versions #767

Closed kbraak closed 9 years ago

kbraak commented 9 years ago
What steps will reproduce the problem?
1. Create a resource from a DwC-A (attached: vascan-originall + meta-original.xml)
2. After all the mapping has been done, publish the resource.
3. The generated meta.xml file (meta-1.xml) is different* than the original one, which
is acceptable behaviour, as the mapping might be different from the original file.
4. Update the source file for the resource, by uploading the same original DwC-A or
one with updated records (same meta-xml and structure though) .
5. Don't change the mapping
6. Publish the resource.
7. The generated meta.xml file (meta-2.xml) is different* from the original one AND
from the meta-1.xml from the previous version.

*Different = The order of the columns has changed. The order of the extensions has
changed (see attachments). Consequently, the data files have changed too (columns switched
around).

This is annoying for users who want to reuse the data in a spreadsheet. Even if only
the records have been updated (no new mapping), the new version will look totally different
from the previous one.

One reason why this behaviour might occur, is be because the IPT is probably not be
aware that no new mapping was made, but it strikes me as odd that the order of the
columns (and extension) just happens at random. To avoid big changes in the order of
the columns, I would propose to always order the columns alphabetically (and not just
at random).

What version of the provider software are you using? The version should be
displayed in the footer of any page.
IPT version 2.0.2-r3299

Original issue reported on code.google.com by peter.desmet.cubc on 2011-08-25 20:13:32


kbraak commented 9 years ago
Thanks for the thorough report of this issue. We will take this under further discussion.

Original issue reported on code.google.com by kyle.braak on 2011-08-26 09:59:32

kbraak commented 9 years ago
Additionally, all files meta.xml change when the server restarts.

Original issue reported on code.google.com by htobon on 2011-08-26 12:39:46

kbraak commented 9 years ago
Just to be clear: ordering the columns/terms alphabetical =

0 = always the id
1-N = alphabetical

Since terms with default values will be included in the data files (issue 605), they
should be included in the alphabetical sorting of the columns/terms.

Original issue reported on code.google.com by peter.desmet.cubc on 2011-08-26 13:25:00

kbraak commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by kyle.braak on 2011-11-04 10:06:39

kbraak commented 9 years ago
Any update on this issue?

Original issue reported on code.google.com by peter.desmet.cubc on 2012-02-14 19:34:40

kbraak commented 9 years ago
I strongly support Peter's quest to have this bug fixed !

I have to automate load of DwcA (coming from IPT) into a database for frequent website
updates, and this is real pain.

thanks!

Original issue reported on code.google.com by niconoe on 2012-04-05 14:25:44

kbraak commented 9 years ago
Thanks Nicolas! Priority=high or critical

Original issue reported on code.google.com by peter.desmet.cubc on 2012-04-05 15:49:53

kbraak commented 9 years ago
NLBIF would like to underline we also suffer from this changing column order problem
as our DwC-A records are harvested automatically for visualisation in Silver Biology
where this is causing trouble.

Original issue reported on code.google.com by c.h.j.hof@uva.nl on 2012-05-09 15:51:57

kbraak commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by kyle.braak on 2012-05-09 16:06:09

kbraak commented 9 years ago
We'll take a look at this and try and accommodate consistent ordering, but anyone relying
on consistent ordering should take note that are building a *very* fragile system.
 I would recommend you reconsider supporting proper DwC-A rather than taking shortcuts.
 The DwC-A reader project exists to make your life very easy if you use Java.

Original issue reported on code.google.com by timrobertson100 on 2012-05-09 16:09:07

kbraak commented 9 years ago
It goes without saying, but please remember that code patches are also welcome

Original issue reported on code.google.com by timrobertson100 on 2012-05-09 16:12:45

kbraak commented 9 years ago
I agree with Tim that no one should rely on the ordering. 
That being said, I think if the class PropertyMapping implements comparable and ExtensionMapping.fields
is a TreeSet<PropertyMapping>() instead of HashSet<PropertyMapping>() it would work.
This is just an idea, I didn't had the time to validate/test that.

Original issue reported on code.google.com by christiangendreau on 2012-05-09 18:24:13

kbraak commented 9 years ago
Yes. I remember that I made this fix change in a local installation, but since no decision
was taken at this moment, I never commited the change. But the solution proposed by
Christian is correct. Cheers.

Original issue reported on code.google.com by htobon on 2012-05-09 18:50:49

kbraak commented 9 years ago
Tim's right when he says this approach is really fragile, but as it seems several people
would like to have it, let's circumvent the problem if an easy fix exists. 

Having a very simple format such as DwcA actually encourage people to hack it, which
has his good and bad sides. DwcA is nice, but not everybody has the will/freedom to
use the Java ecosystem (same for code contributions). Implementing a Python or Ruby
DwcA reader class is on my nice-to-do list for years, but I'm afraid it won't really
take place soon (except if I find a few spare time or a goldmine in my garden).

Original issue reported on code.google.com by niconoe on 2012-05-10 12:05:55

kbraak commented 9 years ago
From SIB Colombia we are interested on this issue to be fixed. Some investigators wants
to download and merge different datasets from IPT instead of downloading from dataportal,
since IPT contains all fields. The order of columns will make easier the DwC-A merge.

We'll be agree that Christian suggestion and Hector implementation be applied.

Original issue reported on code.google.com by daniel.amariles88 on 2013-01-16 15:11:59

kbraak commented 9 years ago
I hope to include this in the next release (2.0.5). If anybody plans to work on a patch,
please update the issue saying that you are working on it. Thanks

Original issue reported on code.google.com by kyle.braak on 2013-03-19 09:36:20

kbraak commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by kyle.braak on 2013-03-19 16:25:55

kbraak commented 9 years ago
Changes committed in r4292. Review of changes currently in progress.

Original issue reported on code.google.com by kyle.braak on 2013-04-29 14:21:08

kbraak commented 9 years ago
This issue was updated by revision r4320.

Original issue reported on code.google.com by kyle.braak on 2013-05-07 17:01:50

kbraak commented 9 years ago
This issue was updated by revision r4326.

Original issue reported on code.google.com by kyle.braak on 2013-05-08 16:11:27

kbraak commented 9 years ago
Verified working. Fix will be included in 2.0.5 release.

Original issue reported on code.google.com by kyle.braak on 2013-05-15 12:43:01

kbraak commented 9 years ago
This issue was updated by revision r4381.

Original issue reported on code.google.com by kyle.braak on 2013-05-15 14:46:43

kbraak commented 9 years ago
This issue was updated by revision r4385.

Original issue reported on code.google.com by kyle.braak on 2013-05-16 07:15:16