Indicia importer changes - observations

andrewvanbreda commented 7 years ago

I am going to be changing/improving the indicia importer to incorporate changes identified for the Plant Portal project but would generally be useful (as previously discussed on skype).

The main changes will be

Include an external_key field on the sample which can be provided as an import column and will allow the user to specify exactly what sample the occurrence rows from the import file will go into.
To allow the importer to flag issues with the import file and allow fulls corrections to be made before the import is committed, currently the system may just import some of the rows, leaving some samples half imported

My plan for the importer is as follows:

If the external key isn't provided, then the importer works as it used to, where concurrent rows that have the same sample details are put into the same sample
If the external key is provide, but the other sample details do not match, then flag as error and don't import, allow user to correct
User is required to correct original file before proceeding (there is no onscreen user interface for corrections as agree)
System has an option where the user can choose whether they want the importer to partially commit in case of errors, or whether want to make all corrections first. System administrators can hide this option from the user and have it set in the configuration file.

Question I do have a query about these changes though having thought about them, can the occurrences be attached to an existing sample already in the database or is it just to organise where the occurrences go into new samples? If we do allow this, how do we make sure it is only an existing sample they have "rights" to. One idea might be to limit attachment to only samples they have already created themselves? Also how does the user know if the occurrences are getting attached to the right sample, for instance if the user uses an external key "123", how do they know that isn't already in use unknowingly in the database. Given these problems am I correct that attachment onto existing samples isn't going to be allowed. Just thinking aloud here, am all ears if anyone has something to add.

Let me know if you disagree with any of this.

I will put anything else I notice as I code and test it into this thread.

Nested in #46

sacrevert commented 7 years ago

I do have a query about these changes though having thought about them, can the occurrences be attached to an existing sample already in the database or is it just to organise where the occurrences go into new samples?

OLP: In theory, that fact that users are being asked to make any corrections prior to submission should mean that there is no need to attach occurrences to existing samples. However, we can still imagine a situation where a genuine error is not picked up by the importer, e.g. a user not including data in the file for submission. If it keeps it simpler, could this situation not simply be dealt with by allowing users (with permissions) to delete samples? This way, if there was an error that had not been picked up, a user could just delete the samples and start again.

If we do allow this, how do we make sure it is only an existing sample they have "rights" to. One idea might be to limit attachment to only samples they have already created themselves?

OLP: If we follow the above strategy this issue should not arise.

Also how does the user know if the occurrences are getting attached to the right sample, for instance if the user uses an external key "123", how do they know that isn't already in use unknowingly in the database.

OLP: Ditto.

Given these problems am I correct that attachment onto existing samples isn't going to be allowed. Just thinking aloud here, am all ears if anyone has something to add.

OLP: Dealt with by first answer above.

andrewvanbreda commented 7 years ago

Yes they could certainly be allowed to delete samples. The Plant Portal Importer submits records at status "I" until the user selects the ones they actually want to be submitted "properly" from a list. I would imagine this list would have checkboxes for removal. However that is specific to Plant Portal, so my only query is your request for Plant Portal only, or do you mean the generic importer should include deletions? I would imagine the deletion functionality should be a part of any system that uses the importer rather than the actual importer itself

sacrevert commented 7 years ago

I meant that, post import, the Portal website would allow deletions. So, the deletion functionality would not be a part of the importer at all. Indeed, in might be months before someone realised that they had missed occurrences off of a sample, so, my thinking was that, at that point, a user with appropriate permission would be able to log in, delete the sample, and then re-import it.

andrewvanbreda commented 7 years ago

Yes that is fine, we can do that no problem

sacrevert commented 7 years ago

Great. I'm assuming that deleting things (or flagging things as deleted) doesn't introduce problems that will only become apparent when one tries to re-import samples/occurrences.

johnvanbreda commented 7 years ago

@andrewvanbreda I saw you were questioning the import_guid field. This is a unique identifier given to each bulk import operation. I.e. if I import 1000 records which get split across 500 samples, then the import_guid will be the same for all of them. This allows us to do things like bulk-undo an entire import. If you get errors and download, fix and upload them, they get tagged with the same import_guid so are kept together.

andrewvanbreda commented 7 years ago

Yes I vaguely worked out what the field did after I wrote the comment, then I pressed the wrong button and posted. Although thanks for the extra detail. Oli: If a sample is deleted with the deleted flag then effectively it can be thought of as completely deleted (we just set a flag in database as it allows undos). So this means the user can import the same sample again as if it were never imported in the first place. So I don't envisage any issues here.

andrewvanbreda commented 7 years ago

Just to note I have now got a version of the generic indicia importer working on my box that has an option that puts the importer into a mode which only commits to the database once all errors have been corrected by the user. This works pretty well (it isn't perfect yet because it doesn't work with every importer option combination e.g. it doesn't currently work with the "Skip mapping stage where possible" option). I will merge this new import behaviour into the plant portal importer which shouldn't be too difficult

I still need to implement the change involving the sample external key, however I have done some tests and there is already a sample external key which already works in a similar way to what we want. The sample external key must match across occurrences for the same sample to be used during import (providing the external key is provided). I think actually, the only thing I need to change is to warn the user if the external key matches on rows but other fields such as date don't, currently a new sample is created automatically in this case. Am I correct in thinking that it is OK to still require occurrence rows going into the same sample to be clustered together in the import file, I assume that is OK. If clustering is required, I will also need to warn the user if rows not clustered together match each other, otherwise we might end up with multiple samples in the database with the same external key. I assume the clustering requirement also reduces the likelihood of errors

sacrevert commented 7 years ago

Am I correct in thinking that it is OK to still require occurrence rows going into the same sample to be clustered together in the import file, I assume that is OK.

I don't think we have ever discussed a clustering requirement in the import file. Although it would be sensible for the user to arrange stuff like that, I don't think we need make it an absolute requirement. So, as far as sample keys go, I don't think it is a problem if non-clustered rows match, we will just have to make it clear that the sample key is the main group indicator for samples, regardless of how the rows in the file are organised.

andrewvanbreda commented 7 years ago

Hi Oli, Sorry for my slow reply, it seems I didn't get an email when you replied like I normally get from Github for some reason. The current importer requires that rows are clustered together to go into the same sample, so changing that does make the programming more complex, however I think it is probably more user friendly to remove this requirement when a sample key is used for matching so I will do that. If the user chooses to import with the old rules (it will be an option) then clustering would be required as usual Note that this change will also affect the generic importer not just the Plan Portal one as was discussed on Skype with David and John a while back.

Just so you know, the Plant Portal situation is currently as follows:

I have a version of the new importer that doesn't commit if there are any errors mostly working on my box, however I am in the process of fixing (difficult) bugs in this as the importer has several modes.
Those changes to the generic importer need merging into the Plan Portal code also
Awaiting coding of the sample key changes to the importer as discussed.
The test site was updated a while back and was theoretically ready to go (with older code), but I found it gave errors, so am awaiting a response from Biren on this (I know what the problem is)

sacrevert commented 7 years ago

OK, thanks Andy. I assume that you will have consulted with @johnvanbreda and @DavidRoy regarding these proposed changes to the generic importer (and the resulting choice that a user would then need to make re clustering in the input file)?

andrewvanbreda commented 7 years ago

Not recently. Only with what we discussed the Skype meeting because I think everything I am implementing is the same as what we agreed in that meeting, the only thing is it was a while back and they might of forgotten what I am doing. A reminder is it is fundamentally two new modes for the generic importer which will also be used by the plant portal importer. A. Provide a mode where the user can choose not to commit anything at all to the database if any errors are found at all B. A mode whereby instead of the system guessing when occurrences are to go in the same sample by comparing dates/grid refs on rows next to each other, we use the sample external key. As noted above, when I tested this I found the importer is actually already fairly close to doing this, but requires changing to allow occurrence rows not to be clustered together for the same sample, but also changes to warn the user if sample keys match but the other data doesn't

The only thing we didn't discuss in the original meeting is the removal of the clustering requirement when the sample external key is used for matching, but that would seem to make sense not to limit it in that way. When I refer to "clustering" I am referring to the current importers need for occurrences rows to be next to each other going to the same sample, however if the situation is we are use a sample key to do the matching we can remove this need. If @johnvanbreda and @DavidRoy have any comments to make let me know.

andrewvanbreda commented 7 years ago

Hi. How important is the need for rows not to be clustered together when we are using a sample external key to match occurrences rows into samples during import. I am currently looking to implement this change, but I think it will be quite a lot of work because the current importer relies heavily on the occurrence rows to going into the same sample being clustered together on consecutive rows. So I want to check we definitely need this bit, perhaps it would be OK to still have the clustering requirement as it would be more usual to have this arrangement of the data in the import file?. I know we discussed this above, but I have now determined this would be quite a big change having looked at the code.

I have currently coded a check during the import which detects that the import sample data is consistent between occurrences when the same sample external key is used, then it warns the user and shows the failing rows if they are inconsistent. For instance, a warning is given if there are two rows, both with sample external key "12345" but the dates are different, the warning is given in this case because the user is effectively telling the system I want these two occurrences to go into the same sample but the dates on the 2 rows are different. This extra warning screen could easily be extended to warn if the data is not correctly clustered together if that is the way we decide to go. This would be a much easier approach than changing the importer to accept non-clustered data when a sample external key is used.

Note that this change affects the "normal" importer as well as the plant portal one, although on the normal importer all the new features I am programming are currently coded to be optional (optional too on plant portal if we so wish)

sacrevert commented 7 years ago

Hi. If the removal of the clustering requirement is a pain, then I think we should abandon it. It is not exactly a strange thing to enforce, and so I have no problem enforcing the clustering of occurrences within samples on users.

andrewvanbreda commented 7 years ago

OK, I will do that then as I think it would be hard to implement and provide good value for money. I will implement a warning where if two rows are found to have the same sample external key but are apart in the file then the user should cluster them together (otherwise separate samples will be created).

andrewvanbreda commented 7 years ago

These changes are coded now into the normal importer (the changes required for using Sample External Key as the identifier, and also the importer mode that doesn't commit unless all errors are fixed). I will merge these changes into the Plant Portal importer and get Biren to update the Plant Portal dev site. I think at that point I will need to discuss the project further with you as I will have implemented the bits we have discussed in detail so far.

sacrevert commented 7 years ago

OK, fine - many thanks. I will be away (work, then holiday) from next week, back on the 8th May, but I'm sure I'll be on email.

andrewvanbreda commented 7 years ago

OK no problem. David might want to check it so I will keep him informed if the code gets onto the dev site, it might depend on when Biren is available to put the code on the site. Just fixing a bug I found now first

DavidRoy commented 7 years ago

I'd be interested to test when ready, thanks

andrewvanbreda commented 7 years ago

Hi David,

No problem. If you want to keep an eye on how things are progressing I have a thread here

https://github.com/BiologicalRecordsCentre/BSBI-Card-and-PlantPortal/issues/47

Those are the requests I make to Biren so you can prioritise them. For now I told him to ignore the previous request until I get the new code to him. I will put the new requests in there when I am ready

andrewvanbreda commented 7 years ago

Just to give you a quick update. The changes to the "normal" importer are ready and committed, that includes support for the different import modes I tried (such as auto-skipping the mapping stage, using occurrence associations, using the different models such as locations). I have tested this, and everything I tried seems to be happy. I have code merged these changes into the Plant Portal importer and that is coded, however I am still testing that merger. I hope that testing will not take too long, but there are a lot of option combinations to test. Also just to clarify the position as I realise this statement was a bit vague. "I think at that point I will need to discuss the project further with you as I will have implemented the bits we have discussed in detail so far."

By "the bits we have discussed in detail so far" I mean the bits I have discussed on the phone in detail with Oli. So that is the importer such as the use of Plot Groups, assigning plot groups to particular people, the screen where the user is informed if there are any issues with the data and in what way it will be importer (e.g. it will tell the user of an existing plot group or plot will be used) Noting this used to be a bit more complex, but as agreed the sample group function has been removed. That removal has been coded.

Any other Plant Portal screens have not been coded yet, but I would expect there to be some other general report screens available when you test such as obviously being able to see your records, hopefully view your groups too.

sacrevert commented 7 years ago

Thanks for the update - let me know if you want to talk on the phone at some point.

andrewvanbreda commented 7 years ago

Hi Oli,

It is prob best to talk once I have passed the code to Biren, until that point I am super-busy working on it anyway. The situation as of this morning is I have got the Plant Portal Importer to now merged and working with the new importer code. So this code is now working with both the "normal" importer and the Plant Portal one. I have also committed the code for all of this, however I would like to simplify the code a little and perform a few further tests (such as your data file) before I passing it to Biren to put on the site. I won't be able to work too much this afternoon or monday, so I unsure when I will reach that point yet.

andrewvanbreda commented 7 years ago

Sorry for the delay, I am trying to fix a bug I found in the more complex uploads I have tested. The problem is the spatial references that are auto-generated and shown on the information/warning screens fine but are not being passed to the warehouse. This creates problems when it is trying to compare the spatial references between samples to see which sample the occurrences should go into..oh..complications....complications

andrewvanbreda commented 7 years ago

Hi, I have now instructed Biren on the required updates, once he has done that I will configure the site for testing. Sorry for the delay on this latest release, the problem is really that whenever a test fails there is always the risk of other breakages. The nature of an importer is that it tends to be completely useless unless it is almost entirely working, unlike a website which can get still be useful with bugs in different place.

andrewvanbreda commented 7 years ago

Hi,

Just to let you know I have just attempted to configure the site but it does not appear to be running the latest code even though an attempt was made to update it. I will work with Biren to resolve that.

andrewvanbreda commented 7 years ago

@sacrevert @davidroy @johnvanbreda the changes to the "normal" importer that have come from the discussions on the Plant Portal project are now available here if you wanted to test/check them.

http://test.brc.ac.uk/plantportal/normal-importer

Perhaps it might be a good idea if you take a look without me explaining in detail because obviously it needs to be understandable to a user without my input, so if you can't work out what is going on something needs changing with the user interface. Two things to note.

These are just the changes to the normal importer, that importer isn't specific to Plant Portal and probably won't be on the final site, however I have to put it on that dev site so you can see it. The Plant Portal specific importer should be ready, but I haven't given it a check and the all clear yet. Will let you know once I have checked it is running OK on that site
There are 2 new checkboxes you can see. These checkboxes can... A. Both be visible and editable by the user (as seen on the current page configuration). B. Hidden on screen by the administrator and set on the Edit Tab. C. Be controlled independently e.g. One checkbox could be user controlled, and one could be hidden by admin D. When the importer is created, the default is that both checkboxes are hidden and set to the way the traditional importer worked i.e. unless an administrator deliberately adjusts their Edit Tab settings you won't notice any difference with the old importer.

johnvanbreda commented 7 years ago

@andrewvanbreda The test server linked above seems to have a config problem with MySQL. However it might be a good idea if I could review the code changes as well - can you point me to the commits or relevant branches?

andrewvanbreda commented 7 years ago

@johnvanbreda Is it this you are getting? that is the error I am getting "Error

The website encountered an unexpected error. Please try again later. Error messagePDOException: SQLSTATE[HY000] [1862] Your password has expired. To log in you must change it using a client that supports expired passwords. in lock_may_be_available() (line 167 of D:\web sites\drupal\v7\multisite\includes\lock.inc)."

Biren can fix that

andrewvanbreda commented 7 years ago

try feature-new-importer-mode both a client_helper and warehouse branch are called this

In the warehouse there is a variable throughout the code call allowCommitToDB, I think most of the changes relate to that.

johnvanbreda commented 7 years ago

@BirenRathod see the error at http://test.brc.ac.uk/plantportal/normal-importer.

BirenRathod commented 7 years ago

@andrewvanbreda & @johnvanbreda, Yes, The expired password error has been resolved now.

johnvanbreda commented 7 years ago

Thanks Biren.

Andy - comments on the UI:

Prevent commits on error - this needs better wording, most users won't understand the word commit. Perhaps "Reject entire import if there are any errors" vs "Reject only rows with errors".
Use sample external key to match occurrences into sample - also needs better wording for the same reason. Perhaps "Auto-detect samples" vs "Samples determined by a provided sample key field" or something like that.
Should the above 2 settings not go on the "Import Settings" page rather than the file selection page?

Code comments:

Apply Drupal standards? Lots of stuff where there are no spaces between things.
You've updated the ORM code in system/libraries. Although we can do this as we are maintaining our own Kohana effectively, I think it should only be done when necessary. Could we not change My_ORM so that when it calls the validate method, it sets the $save parameter to false when not committing? Or, perhaps better would be to edit the overridden validate method (line 317 of My_ORM) to force $save to FALSE when not committing.
In the MY_ORM class, the submit method (called for each row I think) will do a BEGIN, then only do the COMMIT if $allowCommitToDb is true. This means you effectively get a BEGIN for every row and no commit and no rollback either (no necessary because nothing gets saved). Should the BEGIN be skipped as well if $allowCommitToDb is false?

andrewvanbreda commented 7 years ago

I will try to make some of the simpler changes, but I am not sure we have budget at the moment for any changes that would require significant retest. I will let you know what I manage to do

andrewvanbreda commented 7 years ago

Wording changed to Reject entire import if there are any errors: with helpText "Select this checkbox to prevent the importing of any rows if there are any errors at all. Leave this checkbox switched off to import valid rows."

and

"Samples determined by a provided sample key field:" Select this checkbox to import occurrences onto samples using the sample external key field to determine which sample the occurrences are placed into. If you leave this checkbox off, the system will group similar consecutive occurrences into the same sample.

Can't use "Auto-detect samples" wording, as both modes auto-detect samples, just in different ways. I have moved the settings onto the import settings tab and it still seems to be ok.

I can't really change the warehouse at the moment, I spent a while finding a way to get this to work. Will have to leave for later. On the plus side the current code is known to work well. I think I tried altering the $save variable originally and I couldn't get it to work, if I remember it stopped it reporting the validations. Although I can't remember clearly, I will try another time. I will put this to try again in my low priority To Do list.

andrewvanbreda commented 7 years ago

Again in reference to MY_ORM, I am not sure off the top of my head, again it is something I would need to try again later. However I remember clearly the main gripe I had with getting it to work was the validation. Getting it to not save was easy, but I remember having various issues with getting it to report back the correct errors to the user. In the end I was having to let it perform a normal import, but just cut out the commit line. It should be noted this error checking stage is significantly faster that the committing stage, maybe 2 or 3 times faster. So the performance the user sees is acceptable for them, although I accept we might be able to improve the speed for warehouse performance if we can

BiologicalRecordsCentre / BSBI-Card-and-PlantPortal-DEPRECATED-

Indicia importer changes - observations #49