CrisisCleanup / crisiscleanup-2

[OLD] This version of the codebase was retired on March 27, 2020. Open Source Collaborative Disaster Recovery and Cleanup
https://www.crisiscleanup.org
Other
42 stars 24 forks source link

Case Number Collision #406

Closed aarontitus closed 6 years ago

aarontitus commented 7 years ago

About 5%of the cases have duplicate case numbers. The duplicates are always created within a few milliseconds of one another. This is obviously a serious issue.

arroyoDev commented 7 years ago

Biggest concern. Did we lose 5% of our work orders because they were overwritten by the collision? or is the system creating duplicates?

aarontitus commented 7 years ago

We do not loose any data; It's just that John Smith and Jane Jensen share case number "J5027." I'm sure it's just sloppy code. If I had to guess, something like this is happening: The Ruby/Rails code queries the database for all of the existing case numbers, then parses the case number in the code. It then finds the last number, adds one, and uses it. That operation takes a while, and in the meantime someone else is doing exactly the same operation on the same (now outdated) data. Again, I haven't looked at the code; that's just my hunch. I think most of the solution would be to perform the entire operation in code. Here is a hacky example that should give you the gist. I have used this code thousands of times, and I don't think it has produced a collision yet:


SET new.event_id = '61'; 
SET old.case_num = 'J21351'; 
SET new.case_label = 'K'; 

UPDATE legacy_sites 
SET legacy_event_id = CAST(current_setting('new.event_id') AS integer), case_number = current_setting('new.case_label')||subquery.next_case_number 
FROM (SELECT MAX(CAST(nullif(substring(case_number from 2 for char_length(case_number)), '') AS integer))+1 AS next_case_number FROM legacy_sites WHERE legacy_event_id = CAST(current_setting('new.event_id') AS integer)) AS subquery WHERE case_number = current_setting('old.case_num') AND legacy_event_id = CAST(current_setting('old.event_id') AS integer);```
TimBanks commented 7 years ago

One solution to this would be to have the case number be given back to the application by the database instead of the other way around. The case number should be an auto-incremented primary key in postgres, so that upon performing an insert, the case number can be returned as part of the response.

Similar to: https://stackoverflow.com/questions/2944297/postgresql-function-for-last-inserted-id

INSERT INTO legacy_sites (lastname,firstname) VALUES ('Smith', 'John') RETURNING case_number;

TimBanks commented 7 years ago

@aarontitus btw, that's exactly what the code is doing:

      if self.legacy_event_id && self.case_number.blank?
        count = Legacy::LegacySite.where(legacy_event_id: self.legacy_event_id).pluck(:case_number).map { |x| x[/\d+/].to_i }.max
        count = 0 if count.nil?
        event_case_label = Legacy::LegacyEvent.find(self.legacy_event_id).case_label
        self.case_number = "#{event_case_label}#{count + 1}"
aarontitus commented 7 years ago

@TimBanks Cool. I'm suggesting that because the Next Case Number calculation is being done on the Ruby/Rails layer, rather than the PostgreSQL layer, it introduces a few millisecond delay during which two rails operations would return the same result. I think the collision would be solved if the operation were moved to the PostgreSQL layer in a stored operation. Does that make sense?

aarontitus commented 7 years ago

@TimBanks While I agree the case number should be auto-incrementing, it cannot be a primary key, because (for the sake of ease of human use), the number resets for each disaster. Field experience from volunteers has demonstrated that the Alpha + Number format works very well. And during a small disaster with 100 work orders (like a tornado), starting the first case number as "M74802" is disorienting to local volunteers. With 70 disasters, we've now cycled through the alphabet a few times, so there are several "A132" case numbers, for example. This is by design. We try to mold the system to the local disaster, rather than molding the disaster response to tool.

TimBanks commented 7 years ago

Absolutely - I think we're suggesting the same thing in different ways. An auto-incrementing primary key is a native function in postgres to allow it to generate the number on insert, and using the RETURNING function will allow the application to read it from postgres on successful insert. This would eliminate the race condition from the application. :)

aarontitus commented 7 years ago

@TimBanks Generating the case number from the primary key might be a possibility during relative calm, when only one disaster is going at a time. But those times are very rare. There are usually at least two disasters, and sometimes four, operating at all times. Let's say the prefix is "M," and the first case is legacy_sites.id = 72075. legacy_sites.case_number = 'M'||legacy_sites.id-72074. This arrangement would nicely increment M1, M2, M3, M4, etc. Until a few days later we open up new disaster, "N," in a different part of the country. Suddenly N takes off, and adds thousands of work orders in a few days, following the same pattern. N1, N2, N3, N4, all the way up to N1000, in the period of about 48 hours (it happens). Meanwhile, small disaster M's case numbers suddenly lurch from M4 to M1001, causing some serious concern and confusion on the ground.

TimBanks commented 7 years ago

That does make sense. I wonder if it would be better to have a table per disaster?

aarontitus commented 7 years ago

Ha! I was writing that very suggestion as you commented: Alternatively, we could introduce a new numbering table for each disaster, each with its own unique, auto-incrementing primary key. That would solve the problem, but create a pretty ugly schema. It seems that storing the case_number calculation as a PostgreSQL stored procedure will solve the race problem without an exploding schema.

aarontitus commented 7 years ago

I know that @zenjyn has done some work on an extensible schema, even to the point of treating each disaster as its own app with its own database. That full-on abstraction may be necessary at some point when we are gathering gobs of real-time data through mobile apps, but it's probably overkill at the moment.

tsanders commented 7 years ago

The fundamental problem here is that the Rails app is not using DB transactions (along with virtually no DB constraints). What we are seeing here with case number generation is a standard transactional problem that already has solutions. Creating tables per disaster would difficult to manage.

TimBanks commented 7 years ago

@tsanders isn't transactions limited to a single DB connection? If so, at scale, is the app still using a single connection or multiple connections? I imagine ActiveRecord db pools would work, but I've never used it with rails 2.x.

cyphrsonic commented 6 years ago

I wonder how simple it would be to move this case assignment logic into a stored procedure? Something like this:

1) 'CALL sp_CreateCase(all_args_needed_for_create, foo, bar, baz)' 2) This procedure can return the case number or the primary key so that any returns from the creation can be easily called via a select.

I have no experience postgres, but in MySQL this stored procedure would be automatically transactionalized, and writing it as an atomic insert would prevent duplicate ids for being generated since the engine would handle the inserts sequentially.