Harvesters: Naming convention - Githubissues

NextGeoss / ckanext-nextgeossharvest

Harvesters for NextGEOSS

GNU Affero General Public License v3.0

5 stars 1 forks source link

Harvesters: Naming convention #179

Open JGulic opened 4 years ago

JGulic commented 4 years ago

We need to agree on a naming convention for all the harvesters.

This issues is coming from:

different harvesters we have the same name for the metadata field, saved in the database differently Example: Senitnel harvesters have InstrumentName EBAS has Instrument name

in the same harvester we have fields following different naming. For example:

{
"key": "Data definition",
"value": "EBAS_1.1"
},
{
"key": "Originator",
"value": "Bartolome, Maria Lopez, , , , , , , , , "
},
{
"key": "PointOfContact",
"value": "Maria Lopez Bartolome, María Lopez Bartolome, Maria Lopez Bartolome"
},
{
"key": "Instrument name",
"value": "ES0009R_TIS-001"
},
{
"key": "topicCategory",
"value": "climatologyMeteorologyAtmosphere"
},
{
"key": "Component",
"value": "sulphur_dioxide"
},
{
"key": "Station altitude",
"value": "1360.0 m"
},

So here we have topicCategory then Station altitudeand somewhere in the metadata we'll have timerange_start

All of this affects the searching capabilities of the Datahub and Opensearch.

Can we agree on a naming convention for the fields so we are sure that we are going to give the proper results to the user?

joaandrade commented 4 years ago

Hi @JGulic Regarding the Sentinel harvester, we used the names suggested by TDUE during the meetings we had with them. Other harvesters (mainly external projects), what we did was to agree with the Data Providers (e.g. NILU for EBAS Harvester) that we should keep the original field names. Tha's why we have "Instrument name" and "Session Altitude".

My suggestion is, even keeping the original names, for some fields we already have in other harvesters, we should use the same we already have. This means to update EBAS to use InstrumentName (similar to sentinel).

Also we can follow the rule of, even keeping the original metadata field names, try to avoid for example spaces. This means that "Station altitude" would be "StationAltitude".

João

georgiana-b commented 4 years ago

@joaandrade I really think that from a user perspective it's much easier to use our interface and API if we adopt a consistent naming convention rather than have our users guess the formatting of individual fields for every provider.

Why don't we just pick one format (for example camelCase or snake_case and save the original names but formatted to respect the standard.

For example: Instrument name would become instrument_ name topicCategory would become topic_category PointOfContact would become point_of_contact and so on.

Once we have that it will be much easier for people who use our CKAN API or OpenSearch API to know which variables they can filter by. It will also be easier for us to display them in the UI because if we know what format we can expect we can format all the keys in a generic way.

There is a Python library that allows you to convert all sorts of formats to snake case: https://pypi.org/project/stringcase/

georgiana-b commented 4 years ago

To illustrate what I mean, CKAN API allows us to make a faceted search i.e. to filter by the value of certain variables in extras. For example to filter by OrbitDirection we can use this:

https://catalogue.nextgeoss.eu/api/3/action/package_search?fq=OrbitDirection:DESCENDING

Now imagine, if that variable is named Orbit Direction in another harvester, those results will not appear because they wouldn't match the query even though the data meets the criteria.

joaandrade commented 4 years ago

I agree with you suggestion. My concerning is the sentinel metadata. Pilots implemented the query mechanisms using the names implemented before: https://confluence.elecnor-deimos.com/display/NEXTGEOSS/OpenSearch+queryables+per+Pilot+requirements

georgiana-b commented 4 years ago

@joaandrade From what I can see in that document they actually expect to have it snake case (orbit_direction)?

JGulic commented 4 years ago

@joaandrade @georgiana-b the query parameters available for the users for OpenSearch can be different in comparison to the metadata name that we have for that filed in the database. Like @georgiana-b pointed in her comment above, OrbitDirection is orbit_direction for the query in OpenSearch. So the names of the metadata fields in the database can be different to the ones we serve for the client in the OpenSearch. The OpenSearch extension takes care to search in the corespondent metadata filed in the database, We are doing the same thing on the Datahub UI, the table with additional information for the dataset.

georgiana-b commented 4 years ago

@joaandrade Jovanka makes a good point that the variable names in OpenSearch are those required by the standards. You can see here the variables available for Sentinel: https://catalogue.nextgeoss.eu/opensearch/description.xml?osdd=SENTINEL1_L1_SLC

So, in OpenSearch the format is orbit_direction and we take care in our code to follow the format for variables. So for the users of our OpenSearch API it wouldn't make any difference. It would be an improvement only for the users of our CKAN API and it would make our code internally much cleaner.

davidcordeiro-deimos commented 4 years ago

@JGulic @georgiana-b Just to summarize the conclusions of this discussion. Does this issue translate into any actions towards our current goal and objectives? If so, can you link this issue into other open issues and /or create new ones?

georgiana-b commented 4 years ago

@davidcordeiro-deimos After discussing about this with @JGulic and analysing it further, we think that adopting a unified formatting standard for all variables would make a significant improvement both to our interface and to our API. You can see that having different formats looks messy in the UI too, affecting negatively the user experience:

So we propose adopting a snake case format, as I described above, for all extras. The name will remain exactly the same as it is at the source but the format will be changed to comply with the snake_case standard.

This doesn't need to be a change specific to each harvester. I think if you add a method in the base harvester and write some unit tests for it to make sure it formats properly all the common cases, it would be perfectly fine to just call that method in all the harvesters afterwards. https://pypi.org/project/stringcase/ library does a really good job of converting different formats to snake case:

The fields will still be displayed in a human friendly way in the U. If we know we can expect the fields to always be in snake case we can create a generic function on frontend to prettify them.

Our API would also be so much more coherent and user friendly with this change.

In the OpenSearch API variable names would remain exactly as they are now (because they need to follow the OS standard) but the logic we use in the background to achieve that would be much cleaner.

Let me know what you think about this and whether I can assist you in any way.

joaandrade commented 4 years ago

Implemented on the pull request #204