Closed burkmarr closed 3 years ago
A 'backward-compatibility' format is added to rest.php that emulates, as closely as possible, the CSV download format supplied by easy_download_2. A download control can be configured to use this format (as opposed to the default 'standard' format) if this option is set in the in the iForm user interface: @columnsTemplate=easy-download
for the download control.
I have added functionality to the ES download control to allow the iForm created to specify a JSON array of values instead of a single value to the columnsTemplate attribute of a download control, e.g. @columnsTemplate=["default","easy-download"]
to allow a user to select the download format for a download button.
There are three current pull requests associated with this:
(@johnvanbreda - the Warehouse pull-request has failed a unit test. I'll take a look at that tomorrow, but if it's obvious to you why, do let me know.)
@kitenetter @johnvanbreda I have been unable to exactly replicate the non-ES download format exactly. We may need to tinker with the content of the Index if it is a serious problem. Accommodating other download formats, e.g. MapMate & NBN. Given the overhead involved in changing the ES index, we should perhaps evalutate the shortfall of them together?
@kitenetter - The problems in replicating the easy_download_2 format are documented inline - see comments on the fields in the easyDownloadEsCsvTemplate array here: https://github.com/Indicia-Team/warehouse/pull/351/files. I think that the best way for you to progress this is the look at some real world download data. So I think that we should proceed and get the code published so that we can provide a download page where you can download real data and compare the outputs.
@burkmarr the failed pull request is nothing to worry about - I recently added PHP 7.4 testing to Travis and it fails at the moment, due to a different build environment rather than any code I think. Other versions worked OK - https://travis-ci.org/github/Indicia-Team/warehouse.
Regarding addition of fields, I'd support adding the date type plus info about the originally entered sref to the fields in the index.
@johnvanbreda - thanks. I've just been looking at the pull request so some of our comments have crossed! Regarding updating the index - I know that it takes a long time to change. Do we need to take a holistic view after exploring the requirements for MapMate and NBN first before making changes?
@johnvanbreda Thanks for start looking into PHP 7.4. I'm thinking yesterday to ask you about this. I know this takes while to make changes of indicia's code to comply with PHP 7.4 but let me know when ever you finished, so I will start rolling out 7.4 on all our BRC servers.
@BirenRathod probably best to raise an issue about PHP 7.4 support so that David and Martin can prioritise it then.
@johnvanbreda, I will do. thanks.
@burkmarr
Do we need to take a holistic view after exploring the requirements for MapMate and NBN first before making changes?
Yes, I think that makes sense.
@kitenetter @johnvanbreda. Working on the 'backward compatibility format' reminded me that I'd previously written a MapMate exporter for Gilbert 21 so I revisited that and looked into it - this is what I've found.
The MapMate export format seems to be the same now as it was when I wrote the G21 exporter - https://www.mapmate.co.uk/guide/page19.htm#Data%20Import%20from%20Tab-Delimited%20Text%20files.
There are 13 mandatory columns that must be present with the correct names in the correct order: Taxon, Site, Gridref, VC, Recorder, Determiner, Date, Quantity, Method, Sex, Stage, Status, Comment
Here is a potential mapping of these fields to ES index fields:
"mapmate" => [
['caption' => 'Taxon', 'field' => 'taxon.accepted_name'],
['caption' => 'Site', 'field' => 'location.verbatim_locality']
['caption' => 'Gridref', 'field' => 'location.output_sref'],
['caption' => 'VC', 'field' => '#higher_geography:Vice County:code#'],
['caption' => 'Recorder', 'field' => 'event.recorded_by'],
['caption' => 'Determiner', 'field' => 'identification.identified_by'],
['caption' => 'Date', 'field' => '#event_date#'],
['caption' => 'Quantity', 'field' => 'occurrence.organism_quantity'],
['caption' => 'Method', 'field' => 'event.sampling_protocol'],
['caption' => 'Sex', 'field' => 'occurrence.sex'],
['caption' => 'Stage', 'field' => 'occurrence.life_stage'],
['caption' => 'Status', 'field' => ''],
['caption' => 'Comment', 'field' => 'occurrence.occurrence_remarks'],
]
The MapMate docs says this of the 'status' field (https://www.mapmate.co.uk/guide/page10.htm):
"[...] the status of the individual specimen being recorded rather than the rarity, or conservation status, of the Taxon [...] e.g. a "rare", or "endangered", plant may be "native" at one location but "introduced" at another".
As far as I know, most iRecord records will not have an easy mapping for this, but you may know better. If we leave the contents of any field empty, the Indicia code needs to handle the empty mapping.
Permitted MapMate dates for import must be in one of these two formats: dd/mm/yyyy (single) or dd/mm/yyyy-dd/mm/yyyy (range). That's close to the format produced by our current #event_date# formatter except the range separator for that is ' to ' so we would need to made that configurable with a parameter.
The file must be tab separated - not comma separated - so we would have to accommodate that.
That all sounds quite achievable, thanks @burkmarr. We don't have an obvious replacement for the Status value so would have to allow empty fields.
The event date might benefit from a format parameter plus a separator parameter?
Thanks @johnvanbreda. I'll wait for Martin to take a look and if he's happy, I'll create a branch and implement.
@burkmarr just to let you know the develop branch now has code to include the input spatial reference and system which were previously missing from the ES index.
Thanks @johnvanbreda .
See also discussion about providing links to photos under #257
Awaiting input from @kitenetter (will prioritise this as soon as the W drive is available again).
Here are some suggestions re the MapMate-type format.
iRecord_MapMate format notes.docx
From the comments above we had an issue around allowing a blank field for "Record status", but note that this attribute can be set to "Not recorded" so a blank field is not a requirement.
We now have some plans for the backwards-compatible format, and the simplified (MapMate) format. The other thing for me to review is whether we want to add an enhanced format, which I'll add as soon as possible, in case it has any implications for additions to the ES indexing.
Thanks for the notes @kitenetter - I will take a look at this and see what I can come up with.
@kitenetter - this is subject to a pull request (https://github.com/Indicia-Team/warehouse/pull/355/files) which I've aked @johnvanbreda to review. The same pull request deals with the backward-compatibility format from issue #800. I suggest we move subsequent discussion about the backward-compatibility format to this issue (#886).
@kitenetter - John merged into develop branch yesterday and I've pulled this to the dev warehouse and updated this page so that you can download in backward-compatible and mapmate formats: http://test.brc.ac.uk/irecord_dev/es-download-dev. This will go into the live Warehouse when the develop branch is next merged into master and released to the main Warehouse. At that point I can look at providing a similar general ES download page on live iRecord.
Feedback and suggestions from testing the dev site download page is now here: https://docs.google.com/document/d/1Zyh2B6F_PskirbzT49zVFldWGA0QN_e_dp8d9pf-D8U/edit?usp=sharing
@burkmarr if you have time to progress this that would be brilliant. The MapMate download format is needed for the moth recording network, who have their annual conference on 30 Jan, and it would be great if we could make this available by then.
But there are still some issues to resolve, see above link.
Commented on document. Will start work on the items that are clear-cut.
Thanks @burkmarr - I think I have responded to your queries but let me know if I've missed anything
@burkmarr I carried out another test export/import process.
Most of the rejections were probably taxonomic issues (MapMate doesn't recognise UKSI genera, and some subspecies and forms are mismatches - nothing we can do about that), plus a couple of stage terms being rejected (which again we've agreed we can't resolve on our side).
There are still some problems with fields being too long, and very frustratingly it appears that although MapMate SAYS it accepts up to 64 characters, it actually only accepts 62 characters for those fields that are limited.
So the remaining actions are:
Sorry @burkmarr there is one more request, which is to add an additional column for record source (i.e. iRecord app, iRecord Moths, National Moth Night etc.). This can be the same attribute as is used for the standard download (it's just to enable users to filter sets of records out if they need to).
@kitenetter - I've done those three things now.
Also with help from @johnvanbreda I was able to access the record ID for the comment field too. At the point where it is appended to the comment, I can't access the ref that has the warehouse prefix, e.g. brc1|15055739 - I can only get 15055739. To be consistent with the RecordKey field, I've hard-coded the addition of the 'brc1|' prefix for now. That will need addressing if we add another warehouse to the ES index.
I've updated dev warehouse so you can test again if you like. I will also update the merge request on GitHub.
That all sounds great @burkmarr except that my understanding is that we intend to continue using "iBRC" (without a pipe) as the prefix for warehouse 1 (for backwards compatability) rather then "brc1|" - see #938
Does that mean an adjustment to the hard coding?
Hi Martin. I got that from the development ES index which still has the old prefix. I will change.
At John's suggestion, the prefix is now set as a configuration option.
@kitenetter - download page is implemented on live: https://www.brc.ac.uk/irecord/record-download
I've done one or two tests and all seems to be well. It's ready to detect filters that start with the wording 'LERC' and indicate a link to the page you specified.
@burkmarr it appears that verifiers have access to "All records" in the "Records to access" filter - this shouldn't be the case, they should only have "My records" plus their verification role/s and any other specific roles they have been allocated.
I can't think of anyone who ought to have access to an "All records" download (and if they do it ought to be via a specific filter that we assign to them).
I have set the page to unpublished for now, just to avoid any risk.
Apologies, but I've just thought of something else we should add to the columns that we output for the Simple format. This is the vernacular name. It is not needed for MapMate use, but will be wanted by people who prefer the simple format for other uses.
I suggest we add it between the columns for "RecordKey" and "Source".
Other non-essential nice-to-have requests:
Issues and requests for the Standard download format - only the first is urgent:
@burkmarr more apologies from me - I've managed to break the text that pops up to link to the LERC terms. As far as I know all I did was to go into the User Interface and add target="_blank"
to the html link for the LERC terms, in order to force it opening into a new tab. Having done that and saved the page, the terms link is no longer appearing when I select an LERC download.
@kitenetter - this was because the whole HTML string is enclosed in double quote - so any quotes inside it, e.g. around the target attribute value, have to be single quotes. It's working now.
@burkmarr more apologies from me - I've managed to break the text that pops up to link to the LERC terms. As far as I know all I did was to go into the User Interface and add
target="_blank"
to the html link for the LERC terms, in order to force it opening into a new tab. Having done that and saved the page, the terms link is no longer appearing when I select an LERC download.
@burkmarr it appears that verifiers have access to "All records" in the "Records to access" filter - this shouldn't be the case, they should only have "My records" plus their verification role/s and any other specific roles they have been allocated.
I can't think of anyone who ought to have access to an "All records" download (and if they do it ought to be via a specific filter that we assign to them).
Access to that feature is controled on the form (see below). To deny access to all users, I've removed any permission setting as shown in the image.
Apologies, but I've just thought of something else we should add to the columns that we output for the Simple format. This is the vernacular name. It is not needed for MapMate use, but will be wanted by people who prefer the simple format for other uses.
Added, for now, using the 'addColumns' configuration of the download control. This means it will be added to both formats.
I suggest we add it between the columns for "RecordKey" and "Source".
Other non-essential nice-to-have requests:
- Move the "Verified by" column so that it follows the "Query" column
Adding in as part of the 'simple' format will require and update to the rest.api and all the usual leadtime required to get that on the live site unless it's urgent enough to require a 'hotfix' which we would need to talk to John about.
Thanks @burkmarr - that means that the Standard download currently has the vernacular name duplicated in two columns, but that is preferable to it not being included in the Simple download.
That leavs us with one priority issue (how to populate the external key in the Standard download) and a few cosmetic issues to do with renaming and re-ordering columns, which can be done at a later stage.
That leavs us with one priority issue (how to populate the external key in the Standard download)
I've had a look at this and it's due to a typo in the format definition in the REST API (occurrence_external_key
instead of occurrence.external_key
). So that will need an update to the API. Is this urgent enough to require a hotfix? A workaround would be to add 'occurrence.external_key' to the 'addColumns' definition on the form. But this would add to all formats - adding a second column of the same name to the standard format.
I would prefer not to make the new downloads public until this is fixed, because I do regard the ability to make use of external keys to be an important feature of iRecord, and I would be worried if people were downloading versions of the records that did not include them. It's probably only a minority of our users that actually make use of the external keys, but for those who do it is important to maintain the integrity of this data.
So I think this is urgent, and I think the hotfix is preferable if it is not too enormous a piece of work.
I'm reluctant to add another column to the Simple download, but if the hotfix is out of reach before the middle of next week then we will have to use the workaround.
Correction on my part - the field is occurrence.source_system_key
but it still requires fixing in rest.api. Very simple changes for me to make to the code to apply this and the other changes discussed, but I'm not sure what's involved in doing a hotfix. @johnvanbreda - can you comment on that?
@burkmarr I'm currently applying hotfixes relating to the activity summary work so I'll roll this into one of those.
Thanks @johnvanbreda - I've done the changes but I should test on develop warehouse first. I have a short meeting now but will be able to do that in about 30 mins.
The change to the REST API is tiny, and replaces something obviously incorrect, so I don't think there is any risk to deploy that? Will test a download locally first.
Okay - I've pushed to branch download-220121.
The hotfix has now been deployed @kitenetter @burkmarr
Many thanks @johnvanbreda. @kitenetter - I've removed the addColumns field from the form config.
Thanks @burkmarr @johnvanbreda that all looks good and I'll make it available to people on the live site this afternoon.
Thanks everyone, I really hope that this goes down well with our users, I think it is a big step forwards.
Closing.
Thanks for everyone's hared work on this. A real team effort.
Provide options for different download formats for Elasticsearch. This issue has been split off from https://github.com/BiologicalRecordsCentre/iRecord/issues/800.