ES download: replicate functions from existing download page

kitenetter commented 4 years ago

We need to enable an ES download (EDP) that replicates, as far as possible, the functionality of the current download page (CDP), and outputs a csv file containing the same attributes as the current csv file. The following compares CDP (https://www.brc.ac.uk/irecord/downloads) with EDP (https://www.brc.ac.uk/irecord/elastic/download).

CDP: Download type EDP has a "Within" dropdown, but the choices are more limited, and include download of all records (not provided on the CDP). EDP needs to have equivalent function to CDP, e.g. ability to download records for user's verification role, and for user's activity-related records.
CDP: Filter to apply This provides an option for users to apply one of their saved custom filters. This is missing from EDP, although EDP does have a more generalised Search box that would allow (at least some) equivalent filters to be applied on the fly.
CDP: Survey to include In principle this can be replicated by using an appropriate search term in the EDP Search box, but in order to do that one has to know which surveys exist. CDP provides a dropdown.
CDP: Date field/Start date/End date In principle this can be replicated by using an appropriate search term in the EDP Search box, but the syntax required is much less intuitive than the dropdown and data boxes on CDP.
CDP: Select a format to download Not yet available of EDP, which outputs a csv file. csv is undoubtedly the most widely used option, and I don't know how much demand there is for the other options on CDP. CDP's "NBN Exchange" is outdated and ideally needs to be replaced by a Darwin Core format similar to the one used to export to NBN Atlas.
CDP csv file structure CDP provides a csv with 47 columns. There have been requests for renaming, re-ordering and adding to these columns, but to start with it makes sense to provide a csv download structure on EDP that is an exact match for the structure on CDP, in order to maintain compatability with existing user workflows, and to get something up and running while we review enhancements to the download structure.

@burkmarr at out last review telecon I think we agreed that you would look at this issue, is that what you were expecting? I'm not sure what is needed to take this forward; it seems to be a mix of setting up filtering functions on an ES download page plus linking it to an equivalent of the current csv file structure (but generated from the ES table). Is there enough here for you to work on or do we need to do further preliminary work?

I'm adding this to the February milestone but with the recognition that there isn't very much of February left, and this may need more time.

BirenRathod commented 4 years ago

@johnvanbreda Thanks. Great, I will install PG12 on it.

burkmarr commented 4 years ago

The download formats aspect of this issue (item 5 in the list https://github.com/BiologicalRecordsCentre/iRecord/issues/800#issuecomment-638868012) has been moved to a new issue: https://github.com/BiologicalRecordsCentre/iRecord/issues/886.

kitenetter commented 4 years ago

@kitenetter to clarify use cases for different roles within iRecord

burkmarr commented 4 years ago

@johnvanbreda - I'm extending the new ES filterSummary control to incorporate inputs of class es-filter-param which are on the form. I'm a bit hampered by my lack of understanding of how to use a couple of the attributes. Can you give very brief examples here of how you might use the follwoing two attributes?

data-es-query
data-es-nested

burkmarr commented 4 years ago

@kitenetter - as discussed at yesterday's meeting. Here is a page on develop which demonstrates some of the new features of the the ES filtering which can be used for an 'easy download' page: http://test.brc.ac.uk/irecord_dev/es-download-dev

There are notes at the bottom to help evaluate it. Note that not all the controls we've worked on recetly are initially visible on the form, but if you edit the form you can uncomment a couple of them to see how they interact (as explained in the notes). The form uses HTML comments: . So just remove/replace those to add/remove controls from the page.

This is using the ES index implemented on the dev warehouse server. Until recently its performance was okay but I rebuilt it after the recent update to the develop warehouse and data model and now the performance doesn't seem too good - possibly its struggling for resources. In fact it failed completely yesterday and I spent much of the afternoon trouble-shooting it. It seems to work now but you will have to wait longer for it to deliver data than on the live site.

johnvanbreda commented 4 years ago

@burkmarr

data-es-nested allows a filter against data values which are in nested objects within the JSON document (see https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html). Some parts of the occurrence document on ES contain groups of values which relate together as a single object, e.g. a geographical location's ID, name and type are grouped into a nested object. Elasticsearch keeps these as separate mini-documents in the index so when you query the main occurrence document, the nested object data is not in the scope of the query. Setting data-es-nested allows you to define the path to nested objects and switches the context of the query to the nested objects instead of the main document. See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html.

data-es-query - allows a custom Elasticsearch query to be defined as a JSON string. The token #value# will be replaced by the value of the input control. For example:

  <label>Website ID:
    <input type="number"
      class="es-filter-param"
      data-es-bool-clause="must"
      data-es-query="{&quot;term&quot;: {&quot;metadata.website.id&quot;: #value#}}" />
  </label>

johnvanbreda commented 4 years ago

P.S. I updated the docs but haven't rebuilt them yet.

kitenetter commented 4 years ago

@burkmarr here are my comments on the dev page. Let me know if you want anything split out into separate issues. I really like the way the ES page works, if we can tidy up a few loose ends it will be great.

Download notes.docx

Biggest issue to resolve may be how we can replicate the inclusion of custom attributes for particular surveys. I suspect that's going to need some additional development, but if so it shouldn't prevent us going ahead with making the new download page available.

burkmarr commented 4 years ago

@kitenetter - thanks for your analysis. Below are my thoughts in response to your comments. Most of my questions are for @johnvanbreda, but feel free to comment further if you need to. John any general observations would also be welcome.

Date selection If we are going to include the [standardParams] control I suggest that we don't do any work, for now, on providing a separate date selection control. If we get significant feedback that a date selection control outside of the [standardParams] control is required, we can tackle it then.

Dataset selection and custom attributes Sample and occurrence attributes are encoded in the event.attributes and occurrence.attributes ES files respectively. An example of the occurrence attributes for a record is shown below (sample attributes are similar):

{ 
  "value": "1", 
  "id": "93" 
}, 
{ 
  "value": "Parker, Adam", 
  "id": "18" 
}, 
{ 
  "value": "Certain", 
  "id": "54" 
}, 
{ 
  "value": "female", 
  "id": "105" 
}, 
{ 
  "value": "adult", 
  "id": "106" 
}

The attributes values are good to go, but the attribute names are not present – just their IDs. So we would need to query the DB to translate attribute IDs to their names for the CSV column headers. That wouldn't be so bad because we'd just need a couple of queries to match the IDs to the sample and occurrence attribute names for a given dataset. (@johnvanbreda – what is the reason that the attribute names aren't included in the JSON?) We'd have to add some logic to the Rest.php code responsible for building the CSV downloads to parse the JSON and populate the CSV columns.

Backward-compatible CSV format - RecordKey MH comment - The change of format from "iBRC" to "brc1|" will annoy people who want to be able to use this as a persistent ID - is the change necessary? RB - We could add a special field handler to convert keys, e.g. brc1|15036317 to iBRC:15036317 for backward compatible downloads.

Backward-compatible CSV format - Source MH comment - would prefer to stick to the old format if we can, it is easier to understand. RB - We could may be replicate the old format using a special field handler for backward compatibility – website, dataset & recording group are all available in the ES index, so I think it should be possible.

Backward-compatible CSV format - Determiner and Kingdom RB – I will fix the typos.

Backward-compatible CSV format - Output map ref projection MH comment - this sometimes produces a number code - what does that mean? RB – For some reason the 'Output map ref projection' is sometimes provided as the EPSG code for the projection rather than the usual 'human readable' string. The same appears to be true for the 'Projection' field (which corresponds to the input map reference projection). @johnvanbreda – is this something that should be changed in the ES index generation or should I provide a special field handler to translate EPSG codes to a string?

Backward-compatible CSV format - Date type MH comment - This will be needed for compatibility with LERC Recorder systems? RB - date type is not currently in the ES index. @johnvanbreda – I guess it would be feasible for me to add a special field handler to 'reverse engineer' this from date_start, date_end, day_of_year, week and year?

Backward-compatible CSV format - Input on date & Last edited on date MH comment - Not sure why this comes out in an odd format when opening the csv into Excel - is that under our control? RB - The difference seems to be that in the original download format, the date/time in the CSV is like this '2019-10-09 11:31:29' but in the new ES CSV, it is like this '2019-10-09 11:31:29.000' - I think that the millisecond part ('.000') is added by ES. So removing this before outputting would fix it. Again we could do this with a special field handler.

Backward-compatible CSV format - Verification status 1 & Verification status 2 MH comment - needs to be expanded to the word rather than the code.
RB – could be done with a special field handler.

Backward-compatible CSV format - Query MH comment - What gets output here? Ideally should be “queried” or “answered”.
RB - Looks like currently it is a one-letter code, e.g. 'Q' for queried. We could use a special field handler to output the full word.

Backward-compatible CSV format - Automated checks MH comment - the old system is preferable in that it provides understandable data, rather than the obscure number codes of the ES version; but I'm not sure if this column is seen as critical by many users. RB - I suggest that we leave this out and see if anyone notices.

Backward-compatible CSV format - attr_det_full_name MH comment - I've never understood why we need this and would be happy for it to be dropped.
RB – Happy to do that.

Permissions and filter access for LERCs I'm sure we could add some conditional alert on the page to signpost the terms of use for LERCs when a filter associated with a LERC is selected, but I can't see how filters are associated with LERCs. There's the defines_permissions field, but that's not specific to LERCs. I can see the term 'LERC' used here and there in filter titles and descriptions – is that used consistently?

kitenetter commented 4 years ago

Thanks @burkmarr - two points:

Backward-compatible CSV format - RecordKey For the record ID the conversion should be from "brc1|15036317" to "iBRC15036317" (without a colon)

Permissions and filter access for LERCs LERC downloads are enabled by giving their user account the role of "regional collator". It's possible that some regional collators are not LERCs, but most are, and enabling a conditional alert against the regional collator role should be fine I think. The LERC terms are here: https://www.brc.ac.uk/irecord/lrc-download-2 [node 7106]

johnvanbreda commented 4 years ago

@burkmarr some replies for you:

what is the reason that the attribute names aren't included in the JSON?

The original project which the attrs_json field was created for was multi-lingual. Therefore we would have needed to store multiple versions of the attribute captions to meet this projects needs and as the captions are not going to change, I felt this was an excessive overhead compared to the effort required to map IDs to captions. I do understand the other side of this argument though!

Backward-compatible CSV format - RecordKey

Note that the purpose of adding brc1| to the start of the key from Elasticsearch was disambiguation when records had come from multiple warehouses but were stored on a single ES instance, which was an early requirement that has since been put on hold.

Backward-compatible CSV format - Output map ref projection

It should be the case that EPSG codes are where there is a simple x,y coordinate format (or lat/long), whereas the alphabetic codes are indicating a different notation is being used that needs parsing to map to the underlying coordinates. I'm not sure this should be, or needs to be, changed? Perhaps where the code is numeric adding "EPSG:" to the start might be useful? Have you found cases where a grid notation has a numeric projection code?

Backward-compatible CSV format - Date type

This was left out because I was trying to map to Darwin Core as closely as possible in the ES index design. I think that if it is useful though we should add it to the index rather than reverse engineer it from the dates.

Permissions and filter access for LERCs

There is a role on iRecord "regional collator" and this allows a "Location of collation" field to be filled in on the user account. This method of doing it is legacy though and using assignable download filters would be preferable.

burkmarr commented 4 years ago

Many thanks @johnvanbreda. There is overlap with the work I need to do to address some of these 'backward-compatibility' format with the work I've done for the MapMate format which is currently in this pull-request: https://github.com/Indicia-Team/warehouse/pull/354. I can remove that pull request and address these issues in the same branch before resubmitting the pull request. Is that okay with you or would you rather comment on those proposed changes for the MapMate format first?

kitenetter commented 4 years ago

Permissions and filter access for LERCs @johnvanbreda said "using assignable download filters would be preferable". Am I correct in thinking that no new development is required to make that change? I think I can just assign a "survey downloader" role plus an appropriate dataflow filter for each LERC concerned?

An LERC staff member going to the download page would then see the dataflow filter as another choice alongside any other download access they have. We could then set a conditional alert as suggested by @burkmarr to display the LERC terms of use.

However, the "survey downloader" role is used for a wide range of purposes, not just for LERCs. So If my understanding of the above points is correct, what I would propose is that I create a new role "LERC download", and copy over the same permissions as given to "survey downloader". We could then attach the conditional alert to the "LERC download" role.

Backward-compatible CSV format - RecordKey @johnvanbreda said that the change to "brc1|" related to "disambiguation when records had come from multiple warehouses but were stored on a single ES instance, which was an early requirement that has since been put on hold". Is that requirement likely to come back in future? If so we should probably prepare for it now.

One option might be to retain the prefix "iBRC" as referring to the warehouse "brc1|", and then use a different prefix for any subsequent warehouse added, so that we might end up with a set of prefixes: iBRC, 2iBRC, 3iBRC, 1EBMS etc. etc.

But if we really do need to change the pattern in order to future-proof this then I think we should make the change now, alongside the other changes.

burkmarr commented 4 years ago

@kitenetter - the conditional alert would be displayed depending on the filter selected by the user. Since the 'lerc download' role (or any role) would be associated with the user rather than the filter, I don't see how I could use it to trigger the contitional alert. It has to be something I can detect on the associated filter. If we used a naming convention for such filters e.g. including the text '(LERC dataflow)' or something in the name, I could detect that when the filter was selected and set the conditional alert.

With the backward-compatible keys, don't we just need to maintain consistency for keys that have been used before? So why not just replace 'brc1|' with 'iBRC' and if we ever use other warehouses, we can just use the new ES pattern for those, e.g. 'brc2|', since there's nothing for them to remain backwardly compatible with?

kitenetter commented 4 years ago

@burkmarr assuing that I am correct in my understanding of how we move to using assignable download filters, then it would be fine to use a naming convention for the LERC filters. I suggest I prefix each such filter with "LERC download", e.g.:

LERC download - BMERC
LERC download - TVERC etc.

On your second point I think you are saying the same as me, that we stay with "iBRC" for the existing warehouse and implement new prefixes as and when required. Agreed!

burkmarr commented 4 years ago

@kitenetter - I've addressed the points pertaining to the backward-compatibility format in a pull-request as described in issue #886.

burkmarr commented 4 years ago

@kitenetter I've hived off what I think are the remaining issues with this into separate issues (see above) and suggest that we close this one now.

kitenetter commented 4 years ago

Thanks @burkmarr - closing as suggested.

BiologicalRecordsCentre / iRecord

ES download: replicate functions from existing download page #800