ES Download Formats - Githubissues

burkmarr commented 4 years ago

Provide options for different download formats for Elasticsearch. This issue has been split off from https://github.com/BiologicalRecordsCentre/iRecord/issues/800.

burkmarr commented 4 years ago

A 'backward-compatibility' format is added to rest.php that emulates, as closely as possible, the CSV download format supplied by easy_download_2. A download control can be configured to use this format (as opposed to the default 'standard' format) if this option is set in the in the iForm user interface: @columnsTemplate=easy-download for the download control.

I have added functionality to the ES download control to allow the iForm created to specify a JSON array of values instead of a single value to the columnsTemplate attribute of a download control, e.g. @columnsTemplate=["default","easy-download"] to allow a user to select the download format for a download button.

There are three current pull requests associated with this:

Warehouse: https://github.com/Indicia-Team/warehouse/pull/351
Client_helpers: https://github.com/Indicia-Team/client_helpers/pull/106
Media: https://github.com/Indicia-Team/media/pull/33

(@johnvanbreda - the Warehouse pull-request has failed a unit test. I'll take a look at that tomorrow, but if it's obvious to you why, do let me know.)

@kitenetter @johnvanbreda I have been unable to exactly replicate the non-ES download format exactly. We may need to tinker with the content of the Index if it is a serious problem. Accommodating other download formats, e.g. MapMate & NBN. Given the overhead involved in changing the ES index, we should perhaps evalutate the shortfall of them together?

@kitenetter - The problems in replicating the easy_download_2 format are documented inline - see comments on the fields in the easyDownloadEsCsvTemplate array here: https://github.com/Indicia-Team/warehouse/pull/351/files. I think that the best way for you to progress this is the look at some real world download data. So I think that we should proceed and get the code published so that we can provide a download page where you can download real data and compare the outputs.

johnvanbreda commented 4 years ago

@burkmarr the failed pull request is nothing to worry about - I recently added PHP 7.4 testing to Travis and it fails at the moment, due to a different build environment rather than any code I think. Other versions worked OK - https://travis-ci.org/github/Indicia-Team/warehouse.

Regarding addition of fields, I'd support adding the date type plus info about the originally entered sref to the fields in the index.

burkmarr commented 4 years ago

@johnvanbreda - thanks. I've just been looking at the pull request so some of our comments have crossed! Regarding updating the index - I know that it takes a long time to change. Do we need to take a holistic view after exploring the requirements for MapMate and NBN first before making changes?

BirenRathod commented 4 years ago

@johnvanbreda Thanks for start looking into PHP 7.4. I'm thinking yesterday to ask you about this. I know this takes while to make changes of indicia's code to comply with PHP 7.4 but let me know when ever you finished, so I will start rolling out 7.4 on all our BRC servers.

johnvanbreda commented 4 years ago

@BirenRathod probably best to raise an issue about PHP 7.4 support so that David and Martin can prioritise it then.

BirenRathod commented 4 years ago

@johnvanbreda, I will do. thanks.

johnvanbreda commented 4 years ago

@burkmarr

Do we need to take a holistic view after exploring the requirements for MapMate and NBN first before making changes?

Yes, I think that makes sense.

burkmarr commented 4 years ago

@kitenetter @johnvanbreda. Working on the 'backward compatibility format' reminded me that I'd previously written a MapMate exporter for Gilbert 21 so I revisited that and looked into it - this is what I've found.

The MapMate export format seems to be the same now as it was when I wrote the G21 exporter - https://www.mapmate.co.uk/guide/page19.htm#Data%20Import%20from%20Tab-Delimited%20Text%20files.

There are 13 mandatory columns that must be present with the correct names in the correct order: Taxon, Site, Gridref, VC, Recorder, Determiner, Date, Quantity, Method, Sex, Stage, Status, Comment

Here is a potential mapping of these fields to ES index fields:

"mapmate" => [
    ['caption' => 'Taxon', 'field' => 'taxon.accepted_name'],
    ['caption' => 'Site', 'field' => 'location.verbatim_locality']
    ['caption' => 'Gridref', 'field' => 'location.output_sref'],
    ['caption' => 'VC', 'field' => '#higher_geography:Vice County:code#'],
    ['caption' => 'Recorder', 'field' => 'event.recorded_by'],
    ['caption' => 'Determiner', 'field' => 'identification.identified_by'],
    ['caption' => 'Date', 'field' => '#event_date#'],
    ['caption' => 'Quantity', 'field' => 'occurrence.organism_quantity'],
    ['caption' => 'Method', 'field' => 'event.sampling_protocol'],
    ['caption' => 'Sex', 'field' => 'occurrence.sex'],
    ['caption' => 'Stage', 'field' => 'occurrence.life_stage'],
    ['caption' => 'Status', 'field' => ''],
    ['caption' => 'Comment', 'field' => 'occurrence.occurrence_remarks'],
]

The MapMate docs says this of the 'status' field (https://www.mapmate.co.uk/guide/page10.htm):

"[...] the status of the individual specimen being recorded rather than the rarity, or conservation status, of the Taxon [...] e.g. a "rare", or "endangered", plant may be "native" at one location but "introduced" at another".

As far as I know, most iRecord records will not have an easy mapping for this, but you may know better. If we leave the contents of any field empty, the Indicia code needs to handle the empty mapping.

Permitted MapMate dates for import must be in one of these two formats: dd/mm/yyyy (single) or dd/mm/yyyy-dd/mm/yyyy (range). That's close to the format produced by our current #event_date# formatter except the range separator for that is ' to ' so we would need to made that configurable with a parameter.

The file must be tab separated - not comma separated - so we would have to accommodate that.

johnvanbreda commented 4 years ago

That all sounds quite achievable, thanks @burkmarr. We don't have an obvious replacement for the Status value so would have to allow empty fields.

The event date might benefit from a format parameter plus a separator parameter?

burkmarr commented 4 years ago

Thanks @johnvanbreda. I'll wait for Martin to take a look and if he's happy, I'll create a branch and implement.

johnvanbreda commented 4 years ago

@burkmarr just to let you know the develop branch now has code to include the input spatial reference and system which were previously missing from the ES index.

burkmarr commented 4 years ago

Thanks @johnvanbreda .

kitenetter commented 4 years ago

See also discussion about providing links to photos under #257

kitenetter commented 4 years ago

Awaiting input from @kitenetter (will prioritise this as soon as the W drive is available again).

kitenetter commented 4 years ago

Here are some suggestions re the MapMate-type format.

iRecord_MapMate format notes.docx

From the comments above we had an issue around allowing a blank field for "Record status", but note that this attribute can be set to "Not recorded" so a blank field is not a requirement.

We now have some plans for the backwards-compatible format, and the simplified (MapMate) format. The other thing for me to review is whether we want to add an enhanced format, which I'll add as soon as possible, in case it has any implications for additions to the ES indexing.

burkmarr commented 4 years ago

Thanks for the notes @kitenetter - I will take a look at this and see what I can come up with.

burkmarr commented 4 years ago

@kitenetter - this is subject to a pull request (https://github.com/Indicia-Team/warehouse/pull/355/files) which I've aked @johnvanbreda to review. The same pull request deals with the backward-compatibility format from issue #800. I suggest we move subsequent discussion about the backward-compatibility format to this issue (#886).

burkmarr commented 4 years ago

@kitenetter - John merged into develop branch yesterday and I've pulled this to the dev warehouse and updated this page so that you can download in backward-compatible and mapmate formats: http://test.brc.ac.uk/irecord_dev/es-download-dev. This will go into the live Warehouse when the develop branch is next merged into master and released to the main Warehouse. At that point I can look at providing a similar general ES download page on live iRecord.

kitenetter commented 3 years ago

Feedback and suggestions from testing the dev site download page is now here: https://docs.google.com/document/d/1Zyh2B6F_PskirbzT49zVFldWGA0QN_e_dp8d9pf-D8U/edit?usp=sharing

@burkmarr if you have time to progress this that would be brilliant. The MapMate download format is needed for the moth recording network, who have their annual conference on 30 Jan, and it would be great if we could make this available by then.

But there are still some issues to resolve, see above link.

burkmarr commented 3 years ago

Commented on document. Will start work on the items that are clear-cut.

kitenetter commented 3 years ago

Thanks @burkmarr - I think I have responded to your queries but let me know if I've missed anything

kitenetter commented 3 years ago

@burkmarr I carried out another test export/import process.

15571 moth records were exported from iRecord
I did no editing other than deleting the non-MapMate columns
15357 records imported successfully

Most of the rejections were probably taxonomic issues (MapMate doesn't recognise UKSI genera, and some subspecies and forms are mismatches - nothing we can do about that), plus a couple of stage terms being rejected (which again we've agreed we can't resolve on our side).

There are still some problems with fields being too long, and very frustratingly it appears that although MapMate SAYS it accepts up to 64 characters, it actually only accepts 62 characters for those fields that are limited.

So the remaining actions are:

For each attribute that is currently being truncated to 64 characters we need to truncate to 62
The "Habitat" attribute is not needed and should be dropped to further simplify the output

kitenetter commented 3 years ago

Sorry @burkmarr there is one more request, which is to add an additional column for record source (i.e. iRecord app, iRecord Moths, National Moth Night etc.). This can be the same attribute as is used for the standard download (it's just to enable users to filter sets of records out if they need to).

burkmarr commented 3 years ago

@kitenetter - I've done those three things now.

Also with help from @johnvanbreda I was able to access the record ID for the comment field too. At the point where it is appended to the comment, I can't access the ref that has the warehouse prefix, e.g. brc1|15055739 - I can only get 15055739. To be consistent with the RecordKey field, I've hard-coded the addition of the 'brc1|' prefix for now. That will need addressing if we add another warehouse to the ES index.

I've updated dev warehouse so you can test again if you like. I will also update the merge request on GitHub.

kitenetter commented 3 years ago

That all sounds great @burkmarr except that my understanding is that we intend to continue using "iBRC" (without a pipe) as the prefix for warehouse 1 (for backwards compatability) rather then "brc1|" - see #938

Does that mean an adjustment to the hard coding?

burkmarr commented 3 years ago

Hi Martin. I got that from the development ES index which still has the old prefix. I will change.

burkmarr commented 3 years ago

At John's suggestion, the prefix is now set as a configuration option.

burkmarr commented 3 years ago

@kitenetter - download page is implemented on live: https://www.brc.ac.uk/irecord/record-download

I've done one or two tests and all seems to be well. It's ready to detect filters that start with the wording 'LERC' and indicate a link to the page you specified.

kitenetter commented 3 years ago

@burkmarr it appears that verifiers have access to "All records" in the "Records to access" filter - this shouldn't be the case, they should only have "My records" plus their verification role/s and any other specific roles they have been allocated.

I can't think of anyone who ought to have access to an "All records" download (and if they do it ought to be via a specific filter that we assign to them).

I have set the page to unpublished for now, just to avoid any risk.

kitenetter commented 3 years ago

Apologies, but I've just thought of something else we should add to the columns that we output for the Simple format. This is the vernacular name. It is not needed for MapMate use, but will be wanted by people who prefer the simple format for other uses.

I suggest we add it between the columns for "RecordKey" and "Source".

Other non-essential nice-to-have requests:

Move the "Verified by" column so that it follows the "Query" column

kitenetter commented 3 years ago

Issues and requests for the Standard download format - only the first is urgent:

The "Standard download format" is still not populating the External key attribute - is this attribute missing from the ES version of the data, or is it a problem with the download report?
Column "Species" should be renamed "Taxon"
We have two columns called "Projection": I think the first of these is the original projection and the other is the output projection, and if so can they be renamed accordingly
Add column "Rank" (as included in the Simple format), to follow after "Source" and before "Taxon"

kitenetter commented 3 years ago

@burkmarr more apologies from me - I've managed to break the text that pops up to link to the LERC terms. As far as I know all I did was to go into the User Interface and add target="_blank" to the html link for the LERC terms, in order to force it opening into a new tab. Having done that and saved the page, the terms link is no longer appearing when I select an LERC download.

burkmarr commented 3 years ago

@kitenetter - this was because the whole HTML string is enclosed in double quote - so any quotes inside it, e.g. around the target attribute value, have to be single quotes. It's working now.

@burkmarr more apologies from me - I've managed to break the text that pops up to link to the LERC terms. As far as I know all I did was to go into the User Interface and add target="_blank" to the html link for the LERC terms, in order to force it opening into a new tab. Having done that and saved the page, the terms link is no longer appearing when I select an LERC download.

burkmarr commented 3 years ago

@burkmarr it appears that verifiers have access to "All records" in the "Records to access" filter - this shouldn't be the case, they should only have "My records" plus their verification role/s and any other specific roles they have been allocated.

I can't think of anyone who ought to have access to an "All records" download (and if they do it ought to be via a specific filter that we assign to them).

Access to that feature is controled on the form (see below). To deny access to all users, I've removed any permission setting as shown in the image.

burkmarr commented 3 years ago

Apologies, but I've just thought of something else we should add to the columns that we output for the Simple format. This is the vernacular name. It is not needed for MapMate use, but will be wanted by people who prefer the simple format for other uses.

Added, for now, using the 'addColumns' configuration of the download control. This means it will be added to both formats.

I suggest we add it between the columns for "RecordKey" and "Source".

Other non-essential nice-to-have requests:

Move the "Verified by" column so that it follows the "Query" column

Adding in as part of the 'simple' format will require and update to the rest.api and all the usual leadtime required to get that on the live site unless it's urgent enough to require a 'hotfix' which we would need to talk to John about.

kitenetter commented 3 years ago

Thanks @burkmarr - that means that the Standard download currently has the vernacular name duplicated in two columns, but that is preferable to it not being included in the Simple download.

That leavs us with one priority issue (how to populate the external key in the Standard download) and a few cosmetic issues to do with renaming and re-ordering columns, which can be done at a later stage.

burkmarr commented 3 years ago

That leavs us with one priority issue (how to populate the external key in the Standard download)

I've had a look at this and it's due to a typo in the format definition in the REST API (occurrence_external_key instead of occurrence.external_key). So that will need an update to the API. Is this urgent enough to require a hotfix? A workaround would be to add 'occurrence.external_key' to the 'addColumns' definition on the form. But this would add to all formats - adding a second column of the same name to the standard format.

kitenetter commented 3 years ago

I would prefer not to make the new downloads public until this is fixed, because I do regard the ability to make use of external keys to be an important feature of iRecord, and I would be worried if people were downloading versions of the records that did not include them. It's probably only a minority of our users that actually make use of the external keys, but for those who do it is important to maintain the integrity of this data.

So I think this is urgent, and I think the hotfix is preferable if it is not too enormous a piece of work.

I'm reluctant to add another column to the Simple download, but if the hotfix is out of reach before the middle of next week then we will have to use the workaround.

burkmarr commented 3 years ago

Correction on my part - the field is occurrence.source_system_key but it still requires fixing in rest.api. Very simple changes for me to make to the code to apply this and the other changes discussed, but I'm not sure what's involved in doing a hotfix. @johnvanbreda - can you comment on that?

johnvanbreda commented 3 years ago

@burkmarr I'm currently applying hotfixes relating to the activity summary work so I'll roll this into one of those.

burkmarr commented 3 years ago

Thanks @johnvanbreda - I've done the changes but I should test on develop warehouse first. I have a short meeting now but will be able to do that in about 30 mins.

johnvanbreda commented 3 years ago

The change to the REST API is tiny, and replaces something obviously incorrect, so I don't think there is any risk to deploy that? Will test a download locally first.

burkmarr commented 3 years ago

Okay - I've pushed to branch download-220121.

johnvanbreda commented 3 years ago

The hotfix has now been deployed @kitenetter @burkmarr

burkmarr commented 3 years ago

Many thanks @johnvanbreda. @kitenetter - I've removed the addColumns field from the form config.

kitenetter commented 3 years ago

Thanks @burkmarr @johnvanbreda that all looks good and I'll make it available to people on the live site this afternoon.

kitenetter commented 3 years ago

Thanks everyone, I really hope that this goes down well with our users, I think it is a big step forwards.

Closing.

DavidRoy commented 3 years ago

Thanks for everyone's hared work on this. A real team effort.

BiologicalRecordsCentre / iRecord

ES Download Formats #886