BiologicalRecordsCentre / BSBI-Card-and-PlantPortal-DEPRECATED-

A portal to promote plant recording and analysis of plant data

0 stars 0 forks source link

Details of how the importer will work exactly #5

Closed andrewvanbreda closed 4 months ago

andrewvanbreda commented 8 years ago

Nested in #46

Hi, I just wanted to bring up the topic of how the importer will actually work. Now I have started to really get into programming it, I have a better idea of how it can work. Mainly we need to strike a balance between functionality, and what is feasable. In particular we need to make it functional, but also keep in mind if it becomes too complex it might become less reliable and also more difficult to use. Rather than spend hours going back and forth with ideas and maybe not getting very far. I think if I put an idea to you, then you can tell me what alterations you would like to see.

The importer will ask for CSV file upload.
We will then display a mappings screen which allows the user to select which columns in the CSV file headings map to which fields in the database.
When the user continues from here, the system will examine the existing database.
The user is then presented with a screen displaying the results of that examination split into sections.

The sections will have (roughly) the following headings. Note these may vary as I will probably think of more situations.

"The following samples will be created as new records in the database"
"The following samples appear to already exist in the database, note that if you continue with the import these will not be imported"
"The following samples do not currently exist in the database, however their sample groups do exist. These will be imported as new samples attached to those existing groups"
"The following samples and plots do not currently exist in the database, however the plot groups do exist. These will be imported as new plots and samples attached to the existing plot groups"
"The following samples do not currently exist in the database, however the plots already exist. These will be imported as new samples using the existing plots"

Note that although this might seem confusing, we only need display the ones that are applicable at the time. So it is unlikely a user would see all the sections above at the same time.

The user will be advised to continue with the import if all is OK.

Now there are two possibilities of how we could deal with corrections.

We could get them to make edits on screen using the rows displayed in the various sections. The problem with this is that this makes things more complicated, also I think we would want to write those results back to the original file otherwise their import CSV file will be out of date. I think the combination of giving the user full edit capability on the screen and the options to write back to the original file might make life very difficult, espiecially if we want to get something working by the end of august. We also might want to provide options to opt-out of the auto-correction of the original file (or output a copy of it anyway)
The simpler approach would be to simply ask the user to make corrections to the original file. Then get them to click a Re-upload button, at which point we re-examine the file and present them with the results screen again. This process of corrections could continue until they are entirely happy. We could extend this functionality if we have time to dump the text CSV file into a big box on screen, then get them to make the corrections on screen (this is different to what I was thinking for point 1 above, where the rows displayed in the different setions would be editable)

Then when they click import, we import the data into the database.

Let me know what you think.

I also have one further question. Do we need the ability to limit users so they can only use their own groups/plots, currently they could attach data to anyone's group

andrewvanbreda commented 8 years ago

Having had another day's worth of thinking about this, I think my above comments are still valid, however I think we would need to extend the suggestion to possibly include sections on the page which MUST be corrected before the import can continue. For instance, if the sample's identifier name match's another sample's but there is a mismatch for the other data for the sample such as its spatial reference for example. In this case, the data would flagging as being a definite issue. However the correction procedure becomes less clear in this scenario, as it might be the sample already in the database that needs correcting, not the one being imported. Will need to think about this.

andrewvanbreda commented 8 years ago

Actually I think the correction procedure would have to just be to force the import data to use a different sample identifier name. But the point still stands to have different sections. Some for corrections that must be made, and some for the user to confirm that the correct import is about to happen.

sacrevert commented 8 years ago

Regarding your first question, I prefer simply flagging up the (potential) errors, and simply asking them to make their own changes and re-upload the CSV. I think this is better for the user's personal audit trail in general, rather than editing on screen.

I also have one further question. Do we need the ability to limit users so they can only use their own > > groups/plots, currently they could attach data to anyone's group. I think we will probably want to restrict people to their own groups, but, as we discussed elsewhere (#4), the groups of users formed for sample/plot editing permissions could also be used to designate which users could add samples/plots to existing groups. I think generally this will be most useful for plot groups (= projects); sample groups will be much smaller (as mentioned, they will be primarily for linking groups of samples to habitat patches and or analysis), so it seems less likely that multiple users will want to add new samples to existing sample groups.

On a related topic, if we open up other users groups for plot/sample additions, won't this make the issue of group identifiers quite complicated? We will be assignining unique database IDs behind the scenes, but users will be assigning and using their own (almost certainly simpler) identifiers - so, doesn't this mean that we will have to restrict grouping options to a users' own group, or to that set for which they have permissions (and therefore hopefully know the set of identifiers in use)?

Finally, I agree that we will need to distinguish between (i) the set of minimal required fields on upload; and, (ii) those errors that absolutely need correcting and those that the user should be encouraged to double check. As I said, I am not too concerned about the upload being complicated, as long as it is clear to an expert after reading the documentation/following the on-screen guidance.

andrewvanbreda commented 8 years ago

OK. I think my answer is in danger of getting so complicated it might end up being useless. So I will split it into several much shorter answers

First I think yes we can have plot groups and sample groups and the user is only allowed to assign to ones they have permissions for. That permission system could work a bit like NPMS with some differences.

So, the group name is held as a termlist term in a sample or location (plot) attribute.

The groups a user has permission to can be held in two person attribute types One for plot group permissions. One for sample group permission.

Given the above model, a user can be assigned permission to any number of groups. Also in theory, a sample/plot could be assigned to any number of groups. This is good for the backend model even if we apply restrictions in the user interface.

We could use the following rule, you can use plot/samples providing they are in a group that you are a permission member of or the sample/plot is one you created.

There wouldn't have to be any restriction on the size of a group, for example, in theory permissions could be given to use an existing plot by creating a plot group with just 1 plot in it. Then the original creator would have permission, and anyone with permission to that single item group.

This grouping and permissions can work in exactly the same way for samples and plots for simplicity.

andrewvanbreda commented 8 years ago

Now thinking about your point about simplicity of the identifier system for individual plots and samples

Now thinking about this. Plot identifiers are used if users are importing new samples to existing plots in the database. Again, looking at sample identifiers, these are used when importing occurrences that they go onto the same sample.

Now we only need the permissions system at the grouping level, as you will have permissions to samples/plots you create, and any ones you are a group permission member of (even if the group is only 1 in size).

Do we even need individual identifiers for samples and plots? It would be simpler if we can avoid that altogether. Problems will only occur if the user's own plots/samples they have permissions for are identical across the Name/Spatial Reference/Group Name etc. This probably won't happen too often. Given the situation of an identical match, can't we just report the problem on screen to the user (using the same screen with different sections I mentioned before), display the timestamp of when the samples were created (as these would be different) and report the database ID number on screen and get them to alter the import file to include the database ID number of the one they want to use on the import file row.

Don't forget we are using the screen with different sections suggestion I made, the user would always see exactly what is going to happen to all of the data, so it shouldn't be the case that an assignment would be made incorrectly

I think the advantages of what I have suggested in these 2s post would be A. The samples/plots are treated the same way B. We only need to worry about permissions at the grouping level. C. An elaborate identifier system isn't used at the individual sample/plot level which might actually be used very little in practice

Let me know your thoughts

sacrevert commented 8 years ago

OK, yes, that makes sense. I had neglected the fact that that the match would need to be across all of Name/Spatial Reference/Group Name etc.

One thing that does occur to me, which is related to this, is the display and accessibility of the unique database ID number. If academics use a certain set of plots/samples for an analysis that will be published in the academic literature, then it would be useful for them to list the database IDs (many journals now require authors to archive data, or provide links to data in an existing public database). Being able to list the database IDs of plots and samples would be the clearest way of doing this (another would perhaps be to allow users to retrieve plots/samples as queries, and then archive the query result with a unique URL, stored in perpetuity. This is not high priority, but I mention it as something that may be desirable in the long term.

andrewvanbreda commented 8 years ago

I think this situation is OK. Don't forget, using my suggestion a sample or plot could be assigned to any number of groups. There is no reason why a group could not be made for a particular set of plots/samples and then the set of samples/plots would be viewed on a grid like usual in Indicia. That grid would show the IDs if we want it to If that grid has a download option enabled then that that set can be downloaded immediately including the IDs.

I think something I might need to think further about is who can administer the groups.

andrewvanbreda commented 8 years ago

There is a complication with my model that I just want to address so you know what is going on before I progress further. That is the problem of the uniqueness of the group names.

There are two issues here.

We enforce a unique group name across all groups, but this causes a problem as it means I have to examine the database more before the import is made to make sure they aren't reusing a group name. We don't want to make the importer slower and slower the more data there is in the database.
If we don't enforce a unique name for a group, then if someone specifies it on the import, how do we know which one to use. This issue is eased because we are only going to allow people to attach to existing groups if they have permission to do so. However in theory, they could be assigned permission to different groups with the same name if they are not unique.

I believe the answer is to allow non-unique group names. However although the name is not unique, when they are displayed on the system's pages we can display the database ID in brackets next to the name to have a kind of Pseudo-uniqueness for display purposes e.g.

Andy's Group (Group No 252)

During the import, the user would not be required to enter the number (we could make it transparently optional), but in the unlikely situation they are assigned permission two groups with the same name, the importer can flag this and ask them to include the group number in their CSV file to resolve any ambiguity

sacrevert commented 8 years ago

Yes, I agree with your solution. Does this mean we would not be checking the other fields on import (e.g. spatial ref?) - what happens if they put in a new group number by accident, but meant to attach to an existing group (for which they had the correct spatial ref)?

andrewvanbreda commented 8 years ago

Situation: If they enter a new group name/number by accident.

A: If you remember we have a page displayed before the final import confirmation button is pressed. That page would display the record in the section labelled for New Samples With New Sample Groups

Providing the user checks this page, they would see the data is about to be treated incorrectly and would be able to amend it to use the existing group in the CSV upload.

They can then press the re-upload button and the data in the CSV file would be re-examined at which point the record would be displayed in the correct section New Samples Using Existing Sample Groups

andrewvanbreda commented 8 years ago

For occurrence uploads. A match for an existing sample would be made using if the spatial reference, date, sample group etc matched exactly

Only if an exact match was found for a sample that the user has permission to attach to would it be flagged in the "Occurrences to attach to existing samples" section

Otherwise it would flag in the "Occurrences to attach to new samples" section and the user could see the mistake

andrewvanbreda commented 8 years ago

Obviously all these possible different scenarios are complicated behind the scenes, however don't forget the user would probably not usually encounter all the different situations during one import, so what they see on the screen hopefully won't be too complex.

sacrevert commented 7 years ago

(i) I'm adding our recent email chain here for archival purposes: Discussion post Skype 11 11 2016.docx

(ii) However, I realised that I did not review this github issue before writing my responses, and there is one significant conflict. On your long post above (Jul 21) you said:

Do we even need individual identifiers for samples and plots? It would be simpler if we can avoid that >altogether. Problems will only occur if the user's own plots/samples they have permissions for are identical across >the Name/Spatial Reference/Group Name etc. This probably won't happen too often. Given the situation of an identical match, can't we just report the problem on screen to the user (using >the same screen with different sections I mentioned before), display the timestamp of when the >samples were created (as these would be different) and report the database ID number on screen and >get them to alter the import file to include the database ID number of the one they want to use on the >import file row.

I realise this somewhat contradicts what I wrote in the above document, where I noted that vague spatial refs (e.g. vice-counties) might argue for having unique plot/sample IDs. What is your opinion on this?

(iii) Finally, I also wondered where plots/samples need always need groups. I couldn't see that we had discussed this -- would it be permissible to have a plot/sample that was not a member of any grouping?

andrewvanbreda commented 7 years ago

Adding some more email comments made by me and replied to be Oli to this thread, I will then comment on the points below and above

A. We need to nail down exactly what the groups in the system are used for.

I think I have nailed down what was making me uneasy about what has currently been implemented. I think that we were ultimately aiming for the following fields: spatial reference (best available, could be grid ref or, if vague, a boundary); plot ID (created by the user); sample ID (created by the user); vice-county (could duplicate info in spatial ref field); country (likewise); plot group; sample group.

If we don't separate out spatial ref and plot ID, then there will be issues defining plots. For example, if the spatial ref is actually vague ('Scotland', e.g.), then this cannot be used as a unique plot identifier for use in plot groupings. The implication being that a unique plot ID is required to link to the plot groups.

Specifically relating to your question, I now think that the groups will just be used by the plot/sample owners to organise them - I think that using them for anything else (e.g. choosing groups for analysis) will be too complicated, and that, if/when a time comes that a link to another web program is required, then the user will manually select plots for analysis at that stage, not create new plot/sample groups. The same can be said for general users adding plots/samples to their workspaces (if this kind of this is implemented).

B. During the user importer, we need to discuss what the user will be expected to do to correct data, will they need to correct just the file, or are their situations where they will need to correct data in a user interface and how to make this accessible to them if needed.

To an extent this depends on the resource available, but I am happy to go with whatever is easier. Presumably this would be an onscreen or downloadable (text) report - the onus is then on the user to make the corrections in their original file and reupload.

C. Perhaps need to discuss associated species and how they fit into things, or are they a "To Do" for future.

I think this is definitely a desirable (rather than essential) to do item.

D. Confirmation on how we are handling Stratum? My thought is an Occurrence Attribute to indicate the layer, then have different occurrences if species present in more than one layer. Confirm? I think we discussed previously that it would work like this. The occurrences are linked because it is the same sample.

Yes, this sounds sensible, and seems to be inline with how different strata are typically handled (i.e. as separate occurrences). This also seems to be what I had in my schema.

E. You had a query about how the system will make sure that when a user types in a species in an importer row, how we can be sure that it is assigned to the correct species, in particular if the species is not typed exactly. Currently it works the same way as the standard importer......but I haven't examined this area of the standard importer in detail, so I actually don't know what it does to resolve this problem! I think related to above but an idea that we had discussed was a "dangerous names list", where I think we can display a list of alternatives to the user if the species name is detected as being liable to error. However I would like to discuss how we can make this practical, for instance how can we possibly enter all this information into the system in a practical sense.

I'm not exactly sure what you are asking here -- is it not practical to tag the existing UKSI with flags denoting danger, and then report these on screen or in a downloadable text file on upload?

F. I think we need to confirm the data structure for the classifications. I think it was going to be a termlist per classification, but need to revisit. Need to take into account relationships between classifications to be stored as a term attribute. Note we need to confirm that no support is required for data about the relationship (in a similar way that synonyms have no information about the relationship)

I think we agreed that classifications would have attributes, and that this general extension for Indicia termlists was going ahead. I think the relationships between separate classifications should be seen as a desirable feature only, as I doubt we will have the resource available at this time to define these relations.

G. Need to talk through support for community support in the import, and how the users can actually import community information in a practical sense.

Unless I’ve overlooked it, I don’t think we previously discussed allowing users to import community classification data (I think previously all discussion was around generating these memberships within the website, e.g. through MAVIS (which is low priority). That said, I can see that it would be useful to allow this – however, it would make the importer even more complex, particularly as a particular sample could have multiple possible memberships (each with its own goodness-of-fit value). I wonder whether we should leave this for the moment, and address it through an alternative importer in the future (with required columns: sample ID, community, goodness of fit, classification type etc.)

H. Need to talk about the possible requirement for an import specifically for plots/plot groups without necessarily importing samples/occurrences at the same time. Is this needed?

I don’t think this is needed. I think we should assume that a plot will be defined with at least one attached sample.

I. I am not sure whether there is a fix for the following issue, it might be elsewhere in my notes. Need a way to store attributes such as Maximum Abundance Species as these are held against a particular species/community combination. It had been talked about storing these as JSON encoded text in the text_value field of the taxa_taxon_list_attribute, although this is far from ideal

I note that you said that you had investigated this thoroughly, and it seemed the best option available at the mo. Therefore I think we go with it.

J. Need to also clarify how the communities are related to the species. Are they to be imported onto the preferred_tvk then assumed to be associated with the synonyms?

Yes, although I think I was going to provide a preferred_tvk for the NVC community members.

K. I think we need to confirm how we are going to handle the plot shapes and how these will work with the importer. Also how it is going to work to have plots that are larger than 1km represented as a boundary in a practical sense. We talked about this, but I think it needs further though to make sure out ideas are practical in this area.

My thinking was that, for any display purposes, spatial refs under 1 km would be represented at the resolution provided (i.e. as squares, irrespective of dimensions and shapes specified independently). My naïve thinking on the other, vague elements, would be either to (i) allow only vice-counties or countries, and use standard spatial polygons to represent these or, (ii) allow other areas (e.g. parishes, metropolitan areas, nature reserves etc.), which would be geocoded by a web service.

andrewvanbreda commented 7 years ago

Firstly, to deal with the points made in the first comment from today

Point "(ii)" The issue of how vague spatial references would affect the need for plot and sample IDs, I will deal with this when replying to point "A." below.

Point "(iii)" Yes the scenario of having no group is already permitted in the code, although without a grouping, the importer will always create a new sample/plot, otherwise we can't guarantee the existing sample/plot selected by the computer is the intended one. For example, to import an occurrence onto an existing sample, that sample must have a sample group that you have permission to use.

andrewvanbreda commented 7 years ago

Point "A." I think plots are OK, plots are just Indicia locations. Indicia locations have a name, and this can just be used as the plot ID. This is effectively how the importer currently works, I just need to rename Plot Name in my importer as Plot ID However samples are more problematic, as they don't have a similar field, I think at one stage we decided not have a sample ID after all, but at that point I don't think I was considering vague spatial references. However I don't think this will be a problem. I can create a sample ID sample attribute and use it for matching. I think it should still be the case that we still only look for a match in groups you have permissions for, unless you disagree.

Point "B."

OK, I think what I will do is make a list of all the correction scenarios and make sure that they can all be corrected easily by the user. We don't want a situation where the importer blocks an import from a user because corrections are required and then the user is flabbergasted about what to do next. Currently a report about the import is given to the user on screen, I will consider what needs to happen next. If it is obvious that the import file is at fault there is a simple facility for re-upload, the situation becomes more difficult if it isn't clear that the import file itself is at fault, I will review this as I am not sure how many (or any) of those potential difficult situations actually exist in practice.

Point "C."

Yes OK

Point "D."

Yes OK

Point "E." Yes it isn't practical, but we did discuss it in one of our meetings. I suppose the idea would be instead of tagging every single item in the UKSI, we would have a separate lookup list of "dangerous" species names that would be deemed as likely to be misinterpreted by the importer. This lookup list would obviously have to be manageable. So if the importer found the user had entered a taxon name that was liable to be misinterpreted (e.g. it is very similar to something else), then a list of similar items could be displayed to the user to choose from. It was your though, so I am either misunderstanding, or if you don't know what I am talking about and you don't think we need to do anything in this area then would can scrap this idea. Up to you.

Point "F."

Mostly OK. However I have noticed something subtle in what you said that could cause an issue. Currently the community terms will be able to have attributes because I am adding term attributes to Indicia. However you have referred to Classifications as having attributes. The classification is a termlist, does this mean the termlist as well as the terms inside it should have attributes, this might be a natural extension to make, but at the same time we don't want to make Indicia complicated in areas which are not going to be used.

Point "G."

Yes that is why I raised it. I think I saw it somewhere in the spec or notes, but I am not sure it is manageable with the current importer. If/When required, I agree, a separate importer would be the way to go. As you say one we get into the area of samples having an indefinite number of memberships then things would really get messy with the current importer, possibly to the point of being impossible to program. Separate importer would be way to go.

Point "H."

Yes OK. I think it is complex enough anyway, so I agree.

Point "I"

I am still uneasy about this one so I will check my notes further.

Point "J"

Yes I have an import file with the preferred_tvks on. I think I need to look at the importer again to see if I need to comment further on this.

Point "K"

I don't think this would be too complicated in theory. If it is bigger than 1km do "something", if smaller do "something else" is simple logic. I think actually the main issue might be it is going to be that easy for a computer to tell if it is bigger than 1km,? As you say we could go with if the spatial reference square < 1km. However with all the different spatial reference systems (some only being single points) is that going to to work? Point spatial references don't say anything about the plot size, it is just a point, and we wouldn't want to work out the size from widths, lengths diameters. For instance, couldn't we have a plot at a single point, but with a width and length taking it over 1km, Then we need to take into account the system could change in the future to support more things like different shapes. So am wondering if a flag on the row during import would indicate if the plot is smaller or bigger than 1km. Either than or just rely on the spatial reference square, or is my idea more flexible?

sacrevert commented 7 years ago

Point "A" OK, I see. Just to clarify, I presume that using the Indicia location field for Plot ID won't create confusions around having an additional free text location field (e.g. a user might also want to record a nearby landmark as a general location). I am now wondering whether we need sample ID; even with a vague spatial ref, as long as Plot ID is provided, and the importer is only searching the users' groups, then I was wondering whether the sample ID was necessary. However, I then thought that you might have the situation where multiple samples were made of a plot in the same time period (e.g. two surveyors quality checking each other's plots), so then I thought that sample ID probably is useful, even if it is redundant in most instances (because of plot ID + spatial ref + date).

sacrevert commented 7 years ago

Point "E" I think this is necessary, but I'm not sure whether I am going to get around to creating the list this side of Xmas, so, perhaps we can work on the basis that there will be a list that may often result in an error correction type response from the user.

Point "F." Probably just me misusing terminology. As you say, I can see why a classification might need attributes (e.g. bibliographic details about the classification), but I don't think this is necessary, particularly if it is a bit piece of extra work on an already complex system.

Point "K" Ah, I see, I was only thinking of British National grid refs that describe a grid cell depending on the length (e.g. SK3426 etc etc). What about requiring precisions? So, if a numeric (as opposed to vague) spatial ref is entered, then the importer checks for a precision - we could also have a warning that suggests that users enter vague spatial refs (e.g. vice-county names) rather than centroids with an arbitrary precision?

No comments on any of your other points, except agreement.

andrewvanbreda commented 7 years ago

Point "A" I don't there would be confusion as a general landmark could have a different location type. The only confusion might come if we decided plots needed a name as well as an ID. However if needed we can just use an attribute for the ID. I am half thinking of that, because the sample will work like that. However if we do that then we have nothing to put in the location name field. So my preferred method currently is store plot ID in location name, then use a location type of plot.

Point "E" I will have a check what the current "normal" importer does, as surely this has same issue.

Point "F" OK agreed

Point "K" Yes we can do it like that, if you want to only allow British National Grid rather than single points as well, that suggestion is OK to do. You know what you need, so just let me know what you think.

sacrevert commented 7 years ago

Point "A" Fine, I didn't think it would be a problem, just thought I'd raise it (as an extra field for imports)

Point "K" Yes, I did think that only allowing BNG might also be a solution -- it does kind of force people to think more explicitly about precision, but I'm sure there are still who would misuse it (e.g. giving a 10 fig reference for a centroid of a much larger area). The lazy side of me likes going with BNG, but I think best practice really demands a precision (particularly as this is becoming more common anyway with phone GPS), and most surveyors would be recording it. So, I think allowing the standard spatial ref systems but demanding a precision for numerical entries is probably best in the long run.

The only question this leaves in my mind is for the BNG - e.g. we could autodetect precisions for the BNG based on the grid ref, but the fact that someone enters a 1 m precise BNG ref doesn't necessarily mean that that was the true precision, so, perhaps demanding manual entry is best.

andrewvanbreda commented 7 years ago

Yes, that is true, I didn't think of that, although the BNG reference is of a certain size, that doesn't mean that is an accurate representation of the plot. So the decision is.... A. Allow entry in the "usual" indicia allowed grid reference systems. B. Get the user to enter the width, length, diameter of plots as attributes and use these as the basis of the plot size.

So the question remains how do I determine if a plot counts is larger than 1km. Would the best rule be if any of the width, length or diameter is at 1km or more? So in theory a long thin plot would still count as meeting the 1 km criteria

sacrevert commented 7 years ago

I agree with A (although we will recommend BNG), but I'm not sure that B is relevant. My idea was to certainly record length, width, diameter, shape, but these would not be used in any way for the actual display of plot locations, or calculation.

Ultimately the spatial ref provided, along with the precision, give an indication of where the plot is (obviously this is more accurate as the precision increases). To be honest, the >/< 1km idea now seems redundant and over-complicated (sorry!). We can just plot based on the spatial ref - this is easy if it is a BNG or lat long with a precision, or if it corresponds to a vice-county/country. If they enter a name (e.g. a parish) that fails to be georeferenced, then this could be another a reason to reject the import. Or, if you would rather keep it simple, reject anything that is not a grid ref or a vicecounty or country (i.e. mismatches in (fuzzy?) lookup against standard list).

As an addition, we could also ask the user to specify whether a grid ref is a 'centroid', 'corner (NW/SW/SE/NE)' or 'vague' (i.e. anything above the level of the actual plot). But again, this would not be used for anything spatial or in any calculations.

andrewvanbreda commented 7 years ago

Yes, I agree with all of this. I think I will have to see about further validation, as the validation and calculating of whether the samples/plot/group or existing is already very complex. I need to keep in mind the maintainability of the code, if it gets too complex we might end up in a situation where is expensive to change because it would need so much retesting...or it is too difficult to change. Recording the corner won't be a problem, might need to talk to you about the vagueness at some point. It is possible the code might already do some things like reject non-grid references anyway......actually likely rather than possible.

sacrevert commented 7 years ago

OK, no problem. If it is easier, rather than validation, we could provide a standard list of spellings for vice-counties, or demand vice-county numbers instead.

andrewvanbreda commented 7 years ago

Yes I think we will just have to see what is best when I reach that bit of code.

andrewvanbreda commented 7 years ago

Email I sent to Oli to confirm how Vice County logic will work is below. Oli's reponses in italics. My further responses to Oli are in Bold

My thinking is as follows

A column is provided to specify the grid reference type, for each row either "corner", "centroid" or "vague" is specified

I think we can simplify this to “plot location” or “vague location” (as a required field), and then have a free text field named “Location details” (in which people write things like centroid or NW corner). Partly because distinguishing between corners and centroids might be pointless in the presence of GPS location errors, and because we don’t want to then have to ask people which corner (if they choose corner). Related to this, I think we mentioned previously about capturing actual GPS precisions for “true” plot locations (as opposed to the precision of the supplied grid reference (see github). AVB: No problem. I think I might just need to talk to you about GPS precisions and how they work. The last project I did that deal with any GPS stuff was quite a while back.

If "corner" or "centroid" is provided, then a normal grid reference will be expected that will signifying the plot position. A missing grid reference in this circumstance would be flagged by the importer and the import would not continue until corrected by the user Fine, with the above simplification. AVB: Fine
If "vague" is provided in the spatial reference type then Check the grid reference column and if this is filled in we will use this spatial reference under the assumption that is not an accurate spatial reference for the real plotposition. Fine. AVB: Fine

If the spatial reference is missing check the vice county and use this as the grid reference. If the vice-county is missing then check the country and use a grid reference for this as the grid reference. Fine. AVB: Fine

The "corner","centroid","vague" would be held as an attribute against the plot (location) record rather than the sample. Fine. AVB: Fine

I do still have a couple of questions if the above is right:

Do you want the vice county/country information retained, or is it just purely for calculating a missing grid reference and discarded by the importer. I think retain and store in the database. I think all plots should have a stored VC association. Presumably this will make queries against VC much quicker (e.g. for users interested in plots in their area?) AVB: Fine. The only issue is that if you are wanted to do things with the location I will need to do some coding because this would probably need to be stored as an sample attribute (as the location field is in use by the plot) and I don't think attributes currently have location facilities. Should be possible though.
In theory we could still have the grid reference/vice county/country in one column if you prefer, but I think it would greatly aid simplicity if we could still get the user to tell the importer what the field contains, maybe change the options to "corner", "centroid", "vague spatial reference", "vague from vice-county", "vague from country" (or words to that effect). I think it would be problematic, particularly for performance to work out what the user had entered in the grid ref/county/country column manually. Or we can stick with separate columns we originally planned. I think your first idea of checking the columns in sequence until something intelligible is found is better than complicating the labelling of a single column, particularly as this would not work for datasets containing data of mixed accuracy anyway. AVB: Fine

sacrevert commented 7 years ago

Thanks Andy. Regarding GPS precision, this is basically the radius within which a GPS reading is accurate, and is normally between 1 and 20 m (although this depends on GPS unit and satellite availability). I know that Karolis and John have started to use this information directly from phones in the iRecord app -- some records coming through the app are displayed on a map with a centroid and then a surrounding circle displaying the reported precision of the GPS from the phone. In the current case, we would be expecting that surveyors would have made a note of this precision manually, and would sometimes have it available to enter for a "plot location" GPS (although obviously not always).

Regarding VCs, what happens with normal iRecord data? As far as I understand it records have VC information attached to them, and this can't normally be stored as the location -- or is it just the case that VC is reported by a spatial intersect when it is requested by the user (e.g. for verification?) If that is the standard, then perhaps we do not need to store it separately. As long as every plot can be linked to a VC as and when it is necessary (only question I have is what happens when a vague plot location (e.g. a 10 x 10 km square) overlaps two vice-counties -- how does iRecord deal with this? Things that John has recently coded for the BSBI square-based entry page might be useful here (see issue #7).

andrewvanbreda commented 7 years ago

Hi Oli,

That is fine about the GPS precisions, just sounds like an attribute. I just want to check it wasn't something very complex with a learning curve.

As far as the vice county goes I think I understand the situation. You remember I made the suggestion we don't need to save the Vice County as it can be calculated, but then you said we do need to save it. I think actually we are both right. I will double check with John when we have the meeting what iRecord does, but I am pretty sure iRecord would do this. A. Save the location (in our case the plot) against the sample. B. Indicia then builds a quick access list of sample IDs against relevant location Ids for quick access by reports (spatial index builder module).

This latter activity occurs automatically on the Warehouse.

So we don't need to save the vice county explicitly against the sample in order to report on it quickly.

This makes sense, we don't want to save every single location a sample is in against a sample apart from inside that quick access index. e.g. a pond, a street, a suburb, a town, a county, a country etc Samples don't have all these fields in the database.

However, Plant Portal is different because the user enters the Vice County and County explicitly so we use these if there is no grid reference. Essentially they are not "dumb" fields and have code associated with them.

So I think actually we do ideally need an attribute for these to use during the import, however the index builder might also be put to use during reporting.

andrewvanbreda commented 7 years ago

Just a quick note. Am testing the importer with your large import file (6000 or so records). My initial observation is that it would be advantageous to eventually implemented (at low priority, for now anyway) your suggest of collapsible sections on the importer warnings screen, perhaps defaulting to collapsed over a certain number of records. Currently that screen shows the user exactly what it is going to do with each of the 6000 rows, and while it appears to do this fine, it makes the screen very long!

andrewvanbreda commented 7 years ago

Just some comments about performance and the loading of the test file you supplied. I have now been able to run through the large import file. Note that I did have to add a plot ID/name column. These are my observations:

Loading the import description/warnings screen - 40 seconds (no user feedback is currently provided here)

Create everything apart from samples/occurrences (e.g. groups/plots/group permissions) - Also 40 seconds (no user feedback is currently provided at this stage either)

Creating samples/occurrences - 18 minutes (user feedback bar is provided).

Note this later number might vary if it needs to talk to the live warehouse on another server rather than my local computer.. This later number should also be inline with the normal importer because currently it is effectively the same code, shout if you think this is excessive given your expectations/experience (6500+ records), although not sure what I would be able to do to speed this process though.

Note that there were 2591 rejected records, I didn't examine all the reasons, but looking I think it would mostly be rejected spatial references, which is to be expected as there is no code yet for handling vague ones. This does mean there are 1000s of records correctly created, and looking at the results it looks like it successfully created the plots/groups/permissions (whether these are all correct is not know, but they were correct for a smaller sample I loaded). So that is a good sign. If the plots in particular had failed creation, the whole import would have failed.

andrewvanbreda commented 7 years ago

Am just looking at practical solutions to the issue of bringing the occurrence checks forward in the import wizard. There are issues here, one is difficulty of implementation, and another is speed. The problem is we can't get our results without consulting the warehouse......possibly 6000 times, then we want to show the results to the user. Given that effectively we are required to ask the warehouse about the records (just without the database commit) to get the species errors, I am wondering if we can just take advantage of the existing code.

After the warnings screen I display to the user, the user and the records are passed to the warehouse and errors are reported on a row by row basis. There is an issue here, and that is if only some occurrences fail we have no way of correcting. We are left with a half-done import and occurrences that need attaching to imported samples, very confusing.

I am just wondering if a potential solution is to do this. Keep the interface (roughly) as before but only do the commit to the database if there are no errors at all, otherwise report them back to the user in a file as per currently happens.

In a way it would look very similar to the code I have now to the user, except some errors would cause no commits at all, and there would be a re-upload option.

This would effectively mean there are two warnings screens for the user, the one I have written specifically for plant portal about how the import with be processed (e.g. informing the user a sample might using an existing sample group). Then afterwards the warehouse is checked and species errors are returned to the user.

This does have the advantage that the user can make quick corrections from the first screen first without us having to worry doing a huge request to the warehouse as one of the the very first things in the import process. What do you think?

andrewvanbreda commented 7 years ago

Further to my message above, I put in some code on my machine to impede the submit but still report the errors, and then ran it, the performance of this was about 4 times faster than a full submit (5 minutes instead of 20). This suggests to me it would be possible to run that check, then auto rerun with full submit (if no errors) without a massive reduction in speed. So 20 minutes for upload would become 25 with 6800 records (assuming performance is same as my machine, which it wouldn't be in practice, although I don't know if it would be faster or slower). Another note on my suggestion is that it also has the advantage that it will report all errors the warehouse would normally report during import, not just species ones. One big disadvantage I can see at the moment is the code that handles this is quite a core indicia file (my_orm) that we wouldn't want to change, so I would need to find a way round that issue

sacrevert commented 7 years ago

In principle I think these ideas are good, although part of me wonders whether, if ultimately a user with errors will get a report and then have to correct species anyway, we should just provide the Indicia (UKSI) species list as a download, and advise them to check against it manually (e.g. Excel, we could provide guidance) before even attempting upload.

I wonder whether we should leave this decision until the TC with David next week - or will that mean that your time is not well spent between now and then?

andrewvanbreda commented 7 years ago

Hi Oli,

Don't worry about my time before next week as I can get on with the auto-generation of spatial references bit e.g. from Vice Counties.

You are correct in that the user will receive a report of some variety of the problem and they would have to act on it. This could be a file, or we can display it on screen.

As you say, the only way round this would be to try and resolve the problem before the user starts using the importer by providing them with a list f species to use or something.

Obviously you know what your requirements are and you will need to decide what you want, but maybe I can point out a couple of things that influence your decision.

A. The method I suggested would allow the system to catch mistakes the user makes (e.g. typos)

B. If you upload a large amount of historical data, it would catch problems in the data. Presumably your suggestion would only apply to users uploading their data as doing it manually for 6000 records would not work. Given that it would therefore be advantageous to do it for historical data, then the code may as well be left in for the user data (as user and historical data I had envisioned going through the same importer).

Anyway I am happy to wait for a decision on how to proceed.

sacrevert commented 7 years ago

Thanks, I will continue to think about it in advance of the meeting. I was thinking that the guidance provided, would, for example, instruct the user how to compare two lists using VLOOKUP in Excel, allowing them to, in theory, catch all errors before uploading. However, on relection, perhaps this is not that smart, because remaining errors would still need to be checked for and caught anyway (although it might minimise these and so speed the checking step?)

andrewvanbreda commented 7 years ago

Hi Oli, I am not that familiar with Excel apart from the opening a file with it once in a while, so anything you think would help please go ahead with. Just one comment, I do not believe the actual lack of any errors would have any computing performance advantage, as the check would still need to be made by the computer to know there wasn't an error. However given that users won't be uploading 6000 records (I am guessing!) then I don't think the check step will impact on performance to any kind of level that would cause it to be an issue.

OLP I expect the best thing will be the provision of a checklist that the user can check against in Excel, followed by the Indicia check. At the very least it will let the user know what is coming. But we can discuss next week.

andrewvanbreda commented 7 years ago

In regards to the processing of rows for the calculation of a spatial reference from a vice county/country, how many rows do you expect would need to be processed as a maximum? There is a performance overhead with this, but I need to work out what can be done practically. If instance we have a file with 6000 rows and everyone needs this calculating, then this could become an issue

OLP perhaps relatively rare, I expect most historical data complete enough to upload would be resolved to 10 km at least. I doubt there would be many plots with only VC/country level refs.

andrewvanbreda commented 7 years ago

Just a further thought on this issue, I am now thinking actually what I am going to do is simply have a list of vice counties and countries as part of the configuration on the edit tab configuration along with the grid reference we want to use for each. Then I can very easily get at the values without bothering the warehouse with trying to get these values. A little bit of bother for me to initially setup, but I think that is worth the gain.

OLP sounds good!

sacrevert commented 7 years ago

From AvB email (OP reply in bold)

I spoke to John straight after the meeting about detecting errors within the importer. Here are some notes from that discussion. Be aware that he may have a slightly different view depending on how well I was able to capture our discussion. I also only thought of some things after the discussion.

John suggested that a good way would be to have something similar to Recorder where an initial scan before import would detect errors. These could be displayed on screen (for example as red rows). Only once errors are corrected is the data committed by the user. This idea of an initial scan is roughly same as what I suggested so we are on same track. Sounds good.

John suggested that corrections could be made on screen, although I am now noting this as being different to Oli's desire to have corrections made in the original file for Plant Portal before re-uploading. John was not aware of that during our discussion as I only just thought of it. I think I would stick by this, just to save on development time, and to ensure that users amend the original data file.

-John suggested the importer could be modular with new sets of rules plugged in to the importer as required. This may be able to theoretically accommodate Plant Portal specific validation rules. No comment

The other problem with the importer is that it currently assumes a Sample is the same as another row if it looks the same (e.g. same place and date). So the suggestion is to include a sample external_key to identify uniqueness. This would replace the attribute used for this purpose in Plant Portal, however I don't think the change would cause major issues, it is just a different place to store the same thing. In fact it would make it easier if the main importer knew about the unique sample IDs. Yes, sounds good. I suppose this is essential where sample locations are vague to some extent. I agreed with John that these changes would be a significant development. So you would want to think about what you want to do and what you wanted to have implemented. I think I would need to talk to John further to get a proper idea of how long things might take, but I am thinking the error correction pre- import would be a matter of several days.

My own personal opinion would be this (this is my own opinion and hasn't been discussed with John)

Extend the existing importer to include warnings and only commit after warnings are fixed. Extend also to include the suggested external key.

Think about extending importer to include modular rulesets depending on budget.

The Plant Portal Importer is based on the existing importer so could pick up new functionality easily. My own thought would to also keep the existing Plant Portal warnings screen rather than rely on the importer's modular rule sets.

My main concerns are if we rely entirely on the generic importer for Plant Portal, it may be problematic if we suddenly decide we need a big change to how the Plant Portal importer works. If we have Plant Portal specific screens these can be changed. The other issue is that Plant Portal currently works by doing its own custom importing before relying on some of the existing importer. An example is that plots and groups are created and assigned to a user using the Plant Portal Warehouse module, and only after that does the module use the normal import code to import the samples (which can at this point use the newly created plots). The generic importer doesn't understand that Plant Portal needs to assign things to users. This is a point I thought of after discussing with John, so he wouldn't have been aware of the need for this during our discussion.

So my suggestion for Plant Portal is to keep its existing import screen, but allow it to use the enhancements for the generic importer. The disadvantage to this is that the Plant Portal importer will need to have two warnings screens, one for Plant Portal specific warnings, then one for more generic warnings that the warehouse needs to be called for. However two separate screens would allow us to display the first one more quickly to the user without waiting for validation from the warehouse. I think this is fine. Let me know your thoughts, there is no super rush to decide, as I mentioned in the call there are plenty of code tweaks to make.

andrewvanbreda commented 7 years ago

Email that was sent by John that isn't logged in Github:

Briefly summarising from my perspective - I don't get the impression that much of the code written for the Plant Portal importer with regards to error handling could be reused in a generic fashion. Therefore there is little benefit to the Plant Portal project of changing tack and we may as well continue with the custom importer Andy has been working on. However, some of the ideas that Andy and Ollie have discussed and started implementing could be borrowed in a future development of the standard importer. For example we could implement an extra step before the actual import where the code would iterate through rows in the CSV file and try to spot obvious errors (e.g. grid ref format or missing required fields). This could be done much more quickly as we aren't trying to save anything to the db. Any errors reported would need to be fixed before running the actual import. As you've already set up the mappings etc, I could see the advantage of fixing things using an online editor for the already uploaded rows, though I can also understand Ollie's viewpoint that forcing the user to fix the original file is a good thing, so I don't really have an objection either way. Note that once you've fixed the errors running the actual import might still cause additional errors to occur that weren't spotted by the pre-parse, though hopefully we'd minimise the likelihood of this happening. As Andy says, this change would be new code and would take several days to implement. Once done we could potentially make use of it in the Plant Portal importer to tidy things up, but other than code tidiness there are no other real benefits of making the generic change to the Plant Portal project.

andrewvanbreda commented 7 years ago

OK having reviewed all of the above. I think the plan should be

I'll get the Plant Portal test site running for you before doing the mentioned amendments.
Then change the existing generic importer to support the following:

A. Warn the user of issues before committing to the database, requiring the user to correct the original file. Possibly include an option to skip failing rows and continue with import anyway. OLP: Yes, OK. Presumably if some occurrences within a sample fail, then the existence of a user-provided sample ID can be used to relink these through a subsequent import? Currently user is only warned once occurrences are written to the database, this is not desirable as it can lead to confusion and also problems if only some occurrences for a sample have failed.

B. Introduce the concept of a sample ID into the generic importer. This would be used to allow occurrence rows to be linked to specific samples without ambiguity. Perhaps include option to use old method of linking occurrences to samples based on similar consecutive rows.

The Plant Portal would then be able to make use of these changes to the generic importer.

I think we should leave out the modular rulesets for now. I don't see these would be useful for Plant Portal specifically and I think it is better at this stage to get something useful working rather than being overly ambitious and not delivering anything. Perhaps if we keep modular rulesets in mind for the future? OLP: Fine.

The Plant Portal would still need to keep its initial import warnings page as there are things like Plot Groups that are specific to Plant Portal. Also we don't want a situation where we want to make specific changes for Plant Portal and can't because it is entirely reliant on the generic importer. So once the Plant Portal Import warning screen has been passed, the generic importer is run and another warnings screen is shown. This actually happens already, the difference will be that the 2nd screen will now be shown before writing to the database. OLP: Fine.