CrisisCleanup / crisiscleanup-1

[OLD] Legacy Crisis Cleanup on GAE/Python
https://sandy-disaster-recovery.appspot.com
Other
8 stars 4 forks source link

Import Work Orders Feature #133

Open aarontitus opened 11 years ago

aarontitus commented 11 years ago

Original author: v...@aarontitus.net (January 13, 2013 21:56:21)

There needs to be a method to import hundreds/ thousands of records using a combination of a standard, downloadable .csv or .xls/.xlsx file (which updates as the database schema is updated) and wizard to fix records with errors in addresses, Claimed organization, status, etc. The feature must be very fault tolerant if it is going to be available to all participants.

Alternatively, a less fault-tolerant import feature could be implemented if available ONLY to the administrator. The system would rely upon the administrator to fix any logical errors in the combination of status and claim.

This is a very high priority.

Original issue: http://code.google.com/p/sandy-disaster-recovery/issues/detail?id=133

aarontitus commented 11 years ago

From v...@aarontitus.net on January 25, 2013 04:21:10 From an email: Just to be clear, the requirements are: Q1: You have a Excel/CSV file that contains data in a standard structure based on the database schema. This structure is not standardised yet, and after it is standardised, we should update it if we change the schema.

A1: Yes, but it is more likely that each incident will have a dynamic schema, so we shouldn’t wait for a standardized schema. Instead, we should probably create a tool that allows you to download a template for the “current” schema, fill it out, and upload it. Andy Gimma should be able to provide a little more detail about that.

If you upload this file to the server, then: Q2: All the records in the file are added to the database if they pass validation; otherwise a wizard that walks the user through fixing each error is shown. Is this correct?

A2: Yes. That sounds correct. Validation consists of: 1. Is it a valid address that Google Maps can locate? 2. If so, is it a duplicate? See #83. If it fails validation, then give the user the opportunity to fix the addresses, or merge duplicates.

aarontitus commented 11 years ago

From v...@aarontitus.net on January 26, 2013 19:46:11 Process:

  1. Org End User clicks "Download Incident Template"
  2. System queries the relevant Incident Assessment Form to get required/relevant fields.
  3. System creates blank csv template based upon fields in #2. Each field should contain a dummy example information, e.g. "John Smith, 123 Main St., Anytown, PA, 12345..." etc."
  4. Org End User downloads blank template locally.
  5. Org End User manually normalizes data he/she intends to import into the blank template.
  6. Org End User sends the unvalidated CSV to Super Admin and/or Incident Admin. 5.1 "Send" could mean email, upload, etc.
  7. Admin manually validates info (e.g. visually checks to make sure that dates aren't entered into the wrong field).
  8. Admin uploads CSV
  9. System Validates CSV and handles exceptions 9.1 Validations 9.1.1 Address Validation: Use Google Maps API to validate address and calculate lat/long. 9.1.2 Duplicate detection: Use algorithms developed for #83. 9.1.3 Data type validation: Is it a date/time Boolean, etc. Note: "Yes" (case-insensitive) and "True" (case-insensitive) should all be interpreted as Boolean TRUE. "No" (case-insensitive), "False" (case-insensitive), and NULL should be interpreted as Boolean FALSE. 9.1.4 Ignore Dummy example information (See #3 above) 9.2 If a record passes validation, System enters the record as a new Work order. 9.3 Exception Handling: 9.3.1 If fail on a validation in 9.1, add column on extreme right with error message. 9.3.2 Compile all failed records into a new csv file with error messages on extreme right column. Send file back to Super Admin.
aarontitus commented 11 years ago

From cpw...@gmail.com on February 07, 2013 13:20:53 I am currently looking at this (and planning for duplicate detection - #83 - to come afterwards).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 07, 2013 18:28:38 Please see attached screenshot for a partially complete administrator tool for this.

In this approach, the Admin will have the option at the foot of the page to (a) edit (replace) any erroring values before submission, or (b) download an CSV file with the errors annotated, (c) choose to submit only the valid entries and download the remaining invalid CSV.

This is a mix of #1 and #2 above - is this preferable, or should the form UI be dropped in favour of CSV input/output only?

aarontitus commented 11 years ago

From v...@aarontitus.net on February 07, 2013 19:24:27 This approach is GREAT! I'd like to see the edit feature, to make sure that it's user-friendly enough.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 07, 2013 19:50:51 To edit a row (prior to submission), the idea is to use the form controls as shown in the screenshot - controls appear for next to the fields with errors (in red).

Better than separate UI? This is in-place.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 07, 2013 21:40:03 Cool. Not pretty yet... but it doesn't need to be! Looks good. Excited to test.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 07, 2013 23:09:49 Also, we would probably only want to display results with errors; rather than all of them.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 08, 2013 10:40:50 Perhaps that should be an option when viewing the page? i.e. show all / show only errors

Showing all the fields allows manual validation by the Admin, meaning you wouldn't need to check the CSV before upload (as in #2 above).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 25, 2013 19:20:47 I have submitted a version for testing - r186 - and it is available on the testbed app.

It is missing the following features discussed above:

Also, it handles only the current model of Sites (i.e. without phases).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 25, 2013 19:22:47 It's available to the Global Admin only in the Admin interface.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 25, 2013 23:53:26 Attempted to upload a single and failed. On upload, received the following message (in total): https://sandy-helping-hands.appspot.com/admin-import-csv

500 Server Error

Error: Server Error

The server encountered an error and could not complete your request.

If the problem persists, please report your problem and mention this error message and the query that caused it.

I think we need a lot more exception handling. I know I was confused about how to fill it out.

  1. Created Org, "MHH-Cherry Hill Stake"
  2. Created Org, "NJ 2-1-1"
  3. Uploaded attached file
  4. 500 Error.
aarontitus commented 11 years ago

From v...@aarontitus.net on February 26, 2013 00:09:15 Questions about fields:

case_number: What happens if I don't leave this blank? E.g. what if I write "$295Mt" as the case number? name: Are special characters allowed? reported_by: Is a name, like "NJ 2-1-1" permissible? claimed_by: Is a name, like "NJ 2-1-1" permissible? request_date: Date format? address: city: county: What happens if I leave this blank? state: zip_code: cross_street: landmark: phone1: phone2: time_to_call: rent_or_own: work_without_resident: older_than_60: disabled: special_needs: electricity: standing_water: tree_damage: I don't think this field is used. We should eliminate it. tree_damage_details: Is this field used? If not, we should eliminate it. habitable: work_requested: others_help: debris_removal_only: work_type: Does "Trees or Wind" need to be spelled precisely? flood_height: floors_affected: carpet_removal: ceiling_removal: debris_removal: broken_glass drywall_removal heavy_item_removal standing_water pump_needed num_trees_down: What happens if I put a non-integer here? e.g. "15-20" num_wide_trees: Same question. roof_damage tarps_needed goods_and_services tree_diameter: Is this used? if not, then remove. electrical_lines cable_lines cutting_cause_harm other_hazards insurance notes latitude: I assume I can leave this blank, right? longitude priority inspected_by status assigned_to date_closed total_volunteers hours_worked_per_volunteer event: Can this be left blank, or at least be pre-filled in the template? initials_of_resident_present member_of_assessing_organization first_responder hardwood_floor_removal mold_remediation appliance_removal do_not_work_before prepared_by status_notes derechos_work_type: What is this, and can we remove "derechos" from the database?

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 12:06:25 Re #12: the 500 error is due to a couple of unhandled exceptions - will be fixed.

Wrt confusion: Instructions need to be added to the first admin page. - will do following agreement on the below.

Re #13: general questions:

Blank fields are.

All of these (in the next revision), plus '0' and '1'.

Open question: Should be American only, or force to international/ISO (e.g. 2012-02-26) ?

Initially, yes, but this could be configured away.

Yes, I have added this as #204.

More to follow.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 12:16:46

Should this only be assigned by the system?

They should be.

Only if it can be looked up in the Organization table - though this is not implemented yet.

This is ok as long as the address geocodes as a whole (and the Google geocoder is very forgiving).

Moved to #204.

The allowed work_type values should be in the instructions - they currently appear in the error messages (see screenshot in this thread).

I am not sure about this. The field needs to be an integer at present.

We could parse for this biggest integer given??

Moved to #204.

These should probably be hidden.

This should definitely be hidden.

Moved to #205.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 26, 2013 14:05:41

Any fields that are: A) not referenced in a particular Incident, or B) Assigned solely by the system during import should be hidden. Category B) includes:

case_number latitude longitude (latitude_blur, when we add it) (longitude_blur, when we add it) event

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 14:52:35 Agreed - the four existing fields you named will be excluded/hidden.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 17:11:43 Blurred field names also now excluded as in patch on #148 (r192).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 17:25:49 A new version (100) that deals will everything here is now available on the testbed.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 17:29:23 Still todo: determine an approximate maximum number of rows that can be processed this way (before GAE intervenes).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 17:29:54 There is also no duplicate detection (#83).

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 18:07:42 Re #20: I have tested it on localhost with up to 500 valid rows - it took more than one minute, which will cause a request timeout on the appengine platform.

The main cost is bulk geocoding of addresses.

It will be possible to use the Task Queue and the 'deferred' library to run the geocoding and, secondly, saving in the background - but this will mean storing the file and having the Admin user wait for a background process to complete.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 26, 2013 19:11:31 Blocking on removal of Derechos special case handling.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 27, 2013 14:22:35 This seems to be working, except that I keep getting an error related to #21, Comment #27. We'll have to fix that error and try again.

aarontitus commented 11 years ago

From v...@aarontitus.net on February 27, 2013 14:28:24 Importing 500 work orders or more will be common. I had anticipated that there would be a Task Queue, and the admin could see a list of batches "In process", and "Completed," then click on the results links once the imported batches are complete.

This would mean storing the imports somewhat permanently. The Admin should have the option to "Delete Completed Imports," and "Stop and Delete All In-Process Imports."

Does that make sense? Can you visualize this?

...and I know that duplicate detection is on the way.

aarontitus commented 11 years ago

From cpw...@gmail.com on February 27, 2013 22:41:39 Agreed regarding the need to for CSV processing to run as background batch, with admin control available.

In the absence of general detection of duplicates, it may be necessary to mark which rows of a CSV file have been accepted or not (yet).

aarontitus commented 11 years ago

From v...@aarontitus.net on March 01, 2013 22:22:31 Just to be clear: http://code.google.com/p/sandy-disaster-recovery/issues/detail?id=21#c27 is blocking implementation of this. I have been unsuccessful at importing because organizations have not been properly associated with the Incident.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 05, 2013 05:11:54 Error: Imported the attached .csv to Derecho Storm event in sandbox. No errors reported during import. The results page read, "Geocode OK". However:

  1. No lat/lon was entered. lat/long both read "0.0".
  2. The total number of work orders did not increment up. It still says "2 work orders," and should say "3 work orders."
  3. No Work Order Number was entered. https://sandy-helping-hands.appspot.com/sites shows the imported work order, but the work order number reads, "None:"
  4. No county entered. It looks like the interaction with the Google Maps API is just faulty. All other information appears to have been entered correctly.
aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 11:28:59 I confirm all of these and am working on them now.

Please note: testing against Derechos will not work fully because of existing block against #205.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 11:33:28 I think that the work order total is a separate issue - it's not always updated in other cases. Added as new #213.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 12:39:10 Background importing and saving + fixes for #29 above released as r201, version 104 on the testbed.

I am testing with large imports on the testbed now.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 12:59:09 100 row CSV file (of identical records): analyses and saves ok

500 row CSV file: causes memory error - "Exceeded soft private memory limit with 156.738 MB after servicing 5 requests total"

aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 18:14:37 Outstanding bugs:

  1. Tracking down the memory leak has proved difficult. Could it be in the SDK..?

Until fixed, there is protection against failure: import tasks are not automatically retried by the platform (although it really wants to) and a failure is shown to the Admin in the table.

  1. Exceeding the quota of the Google (Maps) geocoder causes rows to fail validation. At present, they cannot be retried.
aarontitus commented 11 years ago

From cpw...@gmail.com on March 05, 2013 18:15:34 Some numbers again:

100 row CSV file (of identical records): analyses and saves ok

400 row CSV file: analyses and saves ok

500 row CSV file: causes memory error - "Exceeded soft private memory limit..."

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 15:00:03 I've been testing for the past couple of days. Is there any way we can automatically do the following whenever I'm uploading a large batch (e.g. 1000 people in batches of ~350):

...this way, I can upload a ton of .csv files, and forget about them for a day or two until they're done.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 15:10:57 I really like the import feature! Working well.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 15:14:11 Strange error on sandbox:

Rockaways_crisiscleanup.org_import.csv Hurricane Sandy Recovery Saving... 274 5 279
Rockaways_crisiscleanup.org_import2.csv Hurricane Sandy Recovery Saving... 270 12 275
Rockaways_crisiscleanup.org_import3.csv Hurricane Sandy Recovery Saved all valid rows 273 8 273

With the first two, all rows were "Saved," including the invalid rows, but for the third one, only valid rows were saved.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 15:16:27 Related error: invalids1.csv Hurricane Sandy Recovery Error: Analysis failed - too large 6 0 0

This is a very small file, but has failed twice. Reason unknown.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 15:17:30 ...oh, and invalids1.csv is a collection of CORRECTED work orders that were invalid.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:21:30 I will look in to the errors in the individual files. If they are confidential and thus can't be posted here, please could you email them to me?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:26:58 Re handling batches:

I have looked at the Google Maps geocoding limits today.

Surprisingly, given that Google Maps now limits per IP address (rather than API key, as it used to), there is no special handling of GAE, meaning that GAE apps use common quota by virtue of being on shared IP addresses (without proxies in place).

Given this, and the memory issues, it may be better to rewrite/refactor the import code to work in a row-by-row fashion - i.e. don't do anything in bulk and do all analysis in single row chunks. This will support a more robust flow closer to #36.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 16:31:33 BUG: There seemed to be an error in invalids1.csv. I'm not sure what it was, but I assumed it occurred on Row 7, since only 6 rows were processed... twice in a row. I placed Row 7 at the end, and re-submitted. It then failed on record 25... and I figured out the error. That particular record has no address. So instead of a geocode error, it throws a "file too large" error.

BUG: After successful import of valid rows into the system, they do not appear on the map, nor in the map search.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:47:53 Re the map bug, was this using to the Derechos or Hattiebsurg incidents on the testbed? I have noticed this happening some times - as meant by #30 above.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 07, 2013 16:47:59 POTENTIAL BUG: The status of both "Rockaways_crisiscleanup.org_import.csv" and "Rockaways_crisiscleanup.org_import2.csv" are "Saving..." even though it's obvious they're done. Are these two processes tying up system resources/costing money?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:53:23 Re #45: the current testbed Task Queue is empty, so it's not currently using any resources.

There are lots of failures present in the logs, some of which I have not seen before, so some of the processes will be failing and not updating the state correctly as it stands.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:54:19 Note: Relevant from the logs: admin_handler/admin_import_csv_handler.py", line 282, in validate_row del(validation_row_copy[field_name]) KeyError: 'address'

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:55:23 Note: lots of transaction collisions - unknown cause (atm).

aarontitus commented 11 years ago

From cpw...@gmail.com on March 07, 2013 16:55:52 re #44: I see not - all tests are on the Sandy and Test incidents.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 08, 2013 02:50:15 FYI- we're going to have a data dump of ~800 work orders from Georgia that may come in as early as tomorrow (Friday).

aarontitus commented 11 years ago

From cpw...@gmail.com on March 11, 2013 15:12:31 When downloading invalid rows to an invalid CSV - would it be better to have the column of error messages prepended to the start of each row rather than the end? Easier to see / harder to miss?