Removing duplicates from XML import

josiahseaman commented 9 years ago

Problem: Import: "no airborne delay" imported many times. Possible duplicates in the xml

Solution: Use get_or_create wherever possible. Only assign a name in the cases where an object does not already exist.

@ndh2 I think I can do this but I could definitely use a second pair of eyes on testing.

ndh2 commented 9 years ago

No problem, I'll watch this thread and test the changes when they go in.

josiahseaman commented 9 years ago

Well I started off by just importing the NAADSM sample scenario without changing anything and it infected everything.... so that's clearly not right.

josiahseaman commented 9 years ago

So I enabled detection on both... This clearly isn't going well

Critical Simulation Error: Correct the following error and try again: \ ERROR:../modules/detection_model.c:791:set_params: assertion failed: (errno != ERANGE && errno != EINVAL) This application has requested the Runtime to terminate it in an unusual way. > Please contact the application's support team for more information.

Maybe we should pair on this issue.

ndh2 commented 9 years ago

Sure, can you email the XML file?

josiahseaman commented 9 years ago

Yes, and I'm on Skype.

Okay, I found the first discrepancy. The two detection fields are marked as optional, but the simulation crashes if they're blank:

    detection_probability_for_observed_time_in_clinical = models.ForeignKey(RelationalFunction, related_name='+', blank=True, null=True,
        help_text='Relational function used to define the probability of observing '+wiki("clinical signs", "clinically-infectious")+
                  ' in units of this ' + wiki("production type") + '.', )
    detection_probability_report_vs_first_detection = models.ForeignKey(RelationalFunction, related_name='+', blank=True, null=True,
        help_text='Relational function used to define the probability of reporting '+wiki("clinical signs", "clinically-infectious")+
                  ' in units of this ' + wiki("production type") + '.')

ndh2 commented 9 years ago

Those should be mandatory. If detection is enabled for a production type, neither one can be blank.

josiahseaman commented 9 years ago

Dial in the wayback machine to early 2014. We had to make all these fields optional because they are not required when use_detection is unchecked... Now I remember. I had intended to add another validation layer which I never put in.

ndh2 commented 9 years ago

I'm in the Hotfix branch, I cd to src and type python3.4 ./xml2sqlite.py export_pop.xml parameters.xml debug.sqlite3. Then I run the simulator on debug.sqlite3 and it runs. What's the setup where you're seeing errors?

josiahseaman commented 9 years ago

It runs, but it infects everything. So then I turned detection on and tripped over those two optional fields. I just regenerated the parameters.xml. I'll send you a new one. This seems to work better. So I guess we're in the clear. I just learned that I need a new validation layer for Control Protocols because of the optional requirement.

population_map

If the current state is working without modification, then I can begin to change this and use the sample scenario to check for validity.

josiahseaman commented 9 years ago

@ndh2 I noticed one problem with the test that you ran. You should have crashed on

src/xml2sqlite.py", line 28, in shutil.copy(workspace_path('settings.sqlite3'), workspace_path('settings.bak'))

Which probably means you have two old scenario dbs laying around. We've moved those into the ADSM Workspace folder. Checkout ADSM/settings.py:35 I'm fixing the xml2sqlite source, but you should delete your old settings.sqlite3 file just to be safe.

josiahseaman commented 9 years ago

I now have a useful test for this issue. I've modified xml2sqlite to count all entries of every ScenarioCreator model. Right now, it's creating 76 entries. If we see that number go down and the scenario still runs correctly, then we know we're eliminating duplicates.

josiahseaman commented 9 years ago

Neil, can you check if this is still valid? xml2sqlite: 692

                protocol = ControlProtocol(
                    name = typeName + " Protocol",
                    use_detection = True,
                    detection_probability_for_observed_time_in_clinical = observing,
                    detection_probability_report_vs_first_detection = reporting,
                    test_delay = zeroDelay # placeholder for now, needed because of NOT NULL constraint
                )

My DB makes no such restriction on NOT NULL but perhaps your simulation does. It turns out this workaround shows up 19 times in the code, it might be faster just to make it optional.

ndh2 commented 9 years ago

Maybe there used to be a NOT NULL constraint before the various use_X switches were added? If there is no such constraint in the DB then that line in xml2sqlite can be deleted.

josiahseaman commented 9 years ago

@ndh2 I removed the default lines. Would you please test this new version from HotFix? I don't think I have the facilities to test it properly.

ndh2 commented 9 years ago

@josiahseaman I re-created all the test suite files using the changed xml2sqlite. The only tests currently failing are tests I have to re-work now that the averagePrevalence output is gone (issue #466). So the changes to xml2sqlite seem to be good.

ndh2 commented 9 years ago

May be worth rethinking this because I think it's behind some confusion in #562 and #576

ndh2 commented 9 years ago

I think create_no_duplicates is a good function but not applicable everywhere.

Where it would be beneficial:

People sometimes use identical sets of airborne parameters from all production type pairings, with the intention to create an all-to-all "uncharacterized local spread".
Where people used named PDFs in old scenario files to indicate a pdf that was common across multiple types.

Where it would be harmful:

When applied to PDFs at the lowest level. For example, I might use "fixed value=1" for latent period for one type, subclinical period for another type, and test delay for another type. It's unlikely a modeler would intend those to all be linked together behind-the-scenes so that changing one would change the others.

ndh2 commented 9 years ago

Looking at the code I realized that create_no_duplicates works in a bottom-up fashion: having create_no_duplicates recognize that two DiseaseProgression objects are identical relies of having used create_no_duplicates to create the individual PDFs the DiseaseProgression links to.

My thought is that create_no_duplicates should not be used for now, until its scope can be reduced to situations where it makes sense. It might be annoying for a modeler to have to change similar parameters in multiple places, but it's safer than a database in which changing one parameter will unexpectedly also change unrelated parameters.

josiahseaman commented 9 years ago

I think this issue is already addressed in the user interface portion. Every PDF has a list of 'backlinks' that show what models are referencing that PDF. I believe the only reason you ran into this is that you were editing the database manually, which isn't a typical use case. As a counterexample, pick a complicated NAADSM scenario you have and run it through xml2sqlite and look at how many PDFs it creates in the UI.

ndh2 commented 9 years ago

If this is something that's already come up with the testing group, and they've thought about it and they're happy with the solution of showing backlinks, then there's no need for a new issue from me. Closing.

josiahseaman commented 9 years ago

Well... I wouldn't say the feature has been user tested yet. But it was requested and implemented.

ndh2 commented 8 years ago

Comment from user testing:

Names of the disease states (latent, subclinical, clinical and immune periods) were repeated over and over. For example, the NAADSM name of a latent period was listed as 'chx latent' and in ADSM, it appeared as 'chx latent, chx latent, chx latent, chx latent, chx latent, chx latent'.

ndh2 commented 8 years ago

Will verify in next release candidate build.

ndh2 commented 8 years ago

I ran the updater (Aug 6 2015) and re-tested this import. Now the PDF names do not repeat: I rechecked the specific example above that imported as 'chx latent, chx latent, chx latent, chx latent, chx latent, chx latent' previously, and now it imports just as 'chx latent'.

missyschoenbaum commented 8 years ago

@ndh2 TIm is emailing his example with duplicates. Thanks for taking a look.

ndh2 commented 8 years ago

In email from T. Boyer Aug 31: "Also, the transport delay parameter has the name - No airborne delay, 0 day delay, 0 day delay, 0 day delay 0 day delay etc..."

ndh2 commented 8 years ago

Seen again from M. Meyer:

long probability function name

missyschoenbaum commented 8 years ago

I am looking at an example where there are numerous functions that call one relational function. This still causes some duplicates, but I think this causes a different situation that the one in this ticket. I am going to call this one good.

NAVADMC / ADSM

Removing duplicates from XML import #459