LBNL-ETA / BEDES-Manager

Other
1 stars 0 forks source link

mapping manager should be using v2.4 of BEDES dictionary, not v2.3 #180

Closed TravisWalterLBNL closed 3 years ago

robHitchcock commented 3 years ago

Need to clear all Composite Terms and Application Mappings from the current database, then update the Dictionary of Atomic Terms to V2.4.

TravisWalterLBNL commented 3 years ago

@wizonesolutions Please prioritize this over all other issues.

wizonesolutions commented 3 years ago

@TravisWalterLBNL The existing mappings include an xslx file, but V2.4 is not on https://bedes.lbl.gov/technical-documentation (which appears to be where BEDES V2.2.xlsx came from). Do we have the process used to update the terms documented anywhere? I would assume it involves the scripts in the scripts/ts directory and that there is a built-in reconciliation process (since I don't think it deletes existing terms prior to importing). Probably via the UUID?

TravisWalterLBNL commented 3 years ago

@wizonesolutions Instead of using the .xlsx files at https://bedes.lbl.gov/technical-documentation, you should instead use https://bedes.lbl.gov/bedes-online/all-terms.csv for the atomic terms and https://bedes.lbl.gov/bedes-online/all-lo.csv for the constrained list options. These files should always be the most recent version (currently v2.4).

I'm not sure if the whole process is documented (see issue #106), but there might be different pieces in different places. It might be that the some of the stuff in scripts/ts was used initially by our first developer (mspear-lbl), and something else was used by our second developer (phgupta). Here are some notes I got from our second developer in June 2020: phgupta-mapping-manager-notes.txt. It might be easier for you to just do it from scratch, rather than adapting what previous developers used.

I'm not sure what you mean by reconciliation. I don't think this is necessary. We should first be deleting everything all atomic terms and constrained list options, and everyone's composite terms, application terms, and application mappings, but keeping user accounts. Then we should recreating the atomic terms and list options using the two .csv files mentioned above. We should end up with atomic terms and list options only (no application mappings, no application terms, and no composite terms).

wizonesolutions commented 3 years ago

@TravisWalterLBNL Ah! It wasn't clear to me that we wanted to delete all of those things. That makes it a lot simpler. The existing import scripts use XML; those should be equivalent to the CSV files, right?

TravisWalterLBNL commented 3 years ago

Yes, we want to delete everything except user login info. All the current atomic terms, composite terms, mappings etc. are using v2.3 of the BEDES dictionary, which is not compatible with v2.4, so for now, the only way to upgrade versions is to just delete everything.

The .xml versions of those two files should be equivalent to the .csv versions, so feel free to use those.

wizonesolutions commented 3 years ago

@TravisWalterLBNL Status-wise on this: I have to enhance the XMLTermLoader to be able to create new data. As it is, it functions as a term updater. Only the XLSX importer can create new data, but we don't have a v2.4 version of that.

We were probably going to have to figure out a sustainable approach to this eventually, anyway, so may as well do it now.

By the way, v2.2 of the XML included an Application property with each term/list option, even if it was N/A. This seems to be removed in v2.4. Should I remove it from the importer altogether or make it optional?

TravisWalterLBNL commented 3 years ago

In general, I think a term creator that will be used to replace the entire version X dictionary with version X+1 will be used way more commonly than a term updater, so I would focus much more on a creator than an updater.

Agreed with doing it now the right way so we don't have to do it later (see issue #106).

I'm not sure I know what an Application property for a term or list option would be. Other than N/A, what were some typical values of this Application property? My guess is that if it is not in the v2.4 .xml or .csv downloads, then we probably don't need it.

robHitchcock commented 3 years ago

The Application property was originally used to indicate where the term came from (e.g., ESPM), or where it applied. We decided to drop this property since it was not that useful and not being updated.

wizonesolutions commented 3 years ago

Yes, it's synonymous with Definition Source. Great; I'll drop the handling code.

On Tue, May 18, 2021, 19:23 Rob Hitchcock @.***> wrote:

The Application property was originally used to indicate where the term came from (e.g., ESPM), or where it applied. We decided to drop this property since it was not that useful and not being updated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-843378495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANHOX62MYAPZ5XK52S3TOKPA3ANCNFSM44TKEQ7Q .

wizonesolutions commented 3 years ago

Note: The site seems to generate invalid XML for empty fields, e.g.

  <node>
    <Content-UUID>b672bdf7-f4ac-4938-81f4-5fa872ecfb80</Content-UUID>
    <URL>https://bedes.lbl.gov/node/b672bdf7-f4ac-4938-81f4-5fa872ecfb80</URL>
    <Term>Preferred Contact</Term>
    <Definition></Definition>
    <Data-Type>Constrained List</Data-Type>
    <Unit-of-Measure>None</Unit-of-Measure>
    <Category>Global Terms</Category>
    <Sector></Sector>
    <Updated-Date>2020-09-17 19:07:20</Updated-Date>
  </node>

They should be omitted. I'll fix these manually, but we should fix those at source too at some point.

TravisWalterLBNL commented 3 years ago

@robHitchcock, correct me if I'm wrong, but I'm pretty sure both those two .xml files (and the two .csv files) are generated automatically by the Drupal site, which is maintained by @caseycobb. So, if there is an error with the .xml creation, he would need to fix it.

I don't know .xml, but these empty fields do seem to be handled correctly in the .csv files.

wizonesolutions commented 3 years ago

Actually, the previous import worked this way too, and it imported alright. I'll see if I can tweak things to work.

robHitchcock commented 3 years ago

Correct, the scripts that generate the CSV/XML docs are related to the Drupal site that Casey and friends have been maintaining.

wizonesolutions commented 3 years ago

I got the XML import working. There were no blank definitions in the past, so I loosened the validation.

I have to get the list option importer fixed now, and then I'll get a staging copy set up for you to poke through and double-check before we update prod.

On Wed, May 19, 2021, 19:00 Rob Hitchcock @.***> wrote:

Correct, the scripts that generate the CSV/XML docs are related to the Drupal site that Casey and friends have been maintaining.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-844293263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANGUAI6TNBGMRDIPPIDTOPVC3ANCNFSM44TKEQ7Q .

wizonesolutions commented 3 years ago

@TravisWalterLBNL I'm encountering a new issue with the 2.4 dataset: several list options have the same UUID.

This doesn't appear to have been the case in 2.2, and the database structure isn't designed for it. And I think that's for good reason; if the semantically same constrained list option can be associated with multiple terms, then it is not "universally-unique." That is, if we look it up by UUID, we don't know if we are getting Term A's list option or Term B's, so it's not really unique.

Can you fix things at source so each term-list-option has a unique UUID? If not, we'll need a longer discussion.

The source data is large enough that I didn't catch this issue until I thought I was done and started encountering duplicate unique key errors from Postgres.

I confirmed that the issue exists in both the CSV and XML exports, so it's at source.

Let me know your thoughts.

TravisWalterLBNL commented 3 years ago

@robHitchcock, let me know if I am not remembering correctly.

I think we intentionally chose to allow a particular list option to be an element in more than one list. For example, there is a term called "State" and another term called "Credential State", and both should just be a list of all the states. Similarly, there are some list options like "Other" and "Unknown" that should be a list option in virtually every list. It seemed silly to us to keep hundreds of copies of the same list option, and instead elected to allow multiple constrained lists to contain the same list option.

So, it wouldn't be easy to make the list options have unique UUIDs. We could consider doing it if it really needs to be done, but we would have to consider the side effects. I'd be happy to get on a call to discuss. Let me know.

You said the UUIDs were unique in v2.2, and that they aren't unique in v2.4, but what about v2.3? Isn't v2.3 using the non-unique UUIDs? Things seems to be running fine with v2.3 and its non-unique UUIDs, so why can't we do whatever we did with v2.3 for v2.4? Am I misunderstanding something?

robHitchcock commented 3 years ago

We should loop @caseycobb into this discussion. We did change some handling of list options on the Drupal site. For example, we included the generic options like "Unknown" and "Other" in all constrained lists. I believe we also made it so that a list option could be referenced by multiple constrained lists. So, this would likely lead to the non-unique UUIDs.

robHitchcock commented 3 years ago

So this complicates checking for duplicate composite terms that use the same list option from different constrained lists. It means that you need to check the constrained list as well as the list option to detect this case.

wizonesolutions commented 3 years ago

The setup scripts for the app are using 2.2. I'm not sure it ever used 2.3 (unless there are no traces of the importer work on that), so I'm not sure the code is modified for that.

What Rob says is right. We will have to store the constrained list ID in composite terms if we don't already and always check list ID + option name. We may as well remove the UUID field from list options.

That makes the scope of this bigger. We may want to think about it and explore the implications first. I'll send an email to discuss.

On Fri, May 21, 2021, 19:57 Rob Hitchcock @.***> wrote:

So this complicates checking for duplicate composite terms that use the same list option from different constrained lists. It means that you need to check the constrained list as well as the list option to detect this case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-846136873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANHST3ZU3DVKP5BEYHDTO2NHPANCNFSM44TKEQ7Q .

wizonesolutions commented 3 years ago

@TravisWalterLBNL Deployed to live and ready for you to validate. app_term.description column fixed, missing Other unit added, BEDES dictionary 2.4 imported without errors. At a cursory glance, things look OK to me. If all checks out, I think we're done with this issue.

TravisWalterLBNL commented 3 years ago

There is still something wrong with the description for the "Project Name" term:

Screen Shot 2021-06-23 at 11 25 50 AM

Since the description looks fine at https://bedes.lbl.gov/bedes-online/project-name, maybe something is going wrong during import?

TravisWalterLBNL commented 3 years ago

Same problem with "Control Communication Protocol". Is there a systematic way to find these? I'm just finding them one at a time by chance, and am not confident I have found them all.

TravisWalterLBNL commented 3 years ago

Same problem with "Diversion Rate".

wizonesolutions commented 3 years ago

I think I saw that the XML importer is expecting a specific set of opening tags in the underlying HTML. Some descriptions might not match what it expects, since they are HTML fields in Drupal. Do we expect the formatting to match in BEDES Manager, or do we just want the text? That might be something we should fix on the Drupal export side if we can.

Do you want to roll back live while we fix this or just leave it as is and delete/re-import the dictionary once fixed? Since no one can really use it meaningfully until we have the dictionary imported to our satisfaction. But let me know. I'll leave it for now.

On Wed, Jun 23, 2021, 22:13 Travis Walter @.***> wrote:

Same problem with "Diversion Rate".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-867126543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANBJNZVCS3XYETBXAG3TUI55HANCNFSM44TKEQ7Q .

TravisWalterLBNL commented 3 years ago

The mapping manager should only care about the text exported from Drupal, not the formatting. Whether that formatting should be removed in the Drupal content, or ignored in the Drupal export, or ignored by the mapping manager import is up to you.

No need to roll back live. If we make these fixes on live, will it affect existing composite terms or application mappings? I'm guessing no, since we would only be modifying the descriptions?

wizonesolutions commented 3 years ago

It would be easiest to wipe the terms again if you can reimport them, yes. Then I won't have to modify the import script too much, so we'll save time. We might want to support updating in the future for things like this, but it's probably not a frequent occurrence yet.

On Thu, Jun 24, 2021, 04:04 Travis Walter @.***> wrote:

The mapping manager should only care about the text exported from Drupal, not the formatting. Whether that formatting should be removed in the Drupal content, or ignored in the Drupal export, or ignored by the mapping manager import is up to you.

No need to roll back live. If we make these fixes on live, will it affect existing composite terms or application mappings? I'm guessing no, since we would only be modifying the descriptions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-867275675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANECWNHQHJAHSOOJXO3TUKHETANCNFSM44TKEQ7Q .

TravisWalterLBNL commented 3 years ago

OK, go ahead and wipe and I will reimport. Let's worry about allowing updating later.

wizonesolutions commented 3 years ago

@TravisWalterLBNL Can you validate on stage? I'll do the same changes on prod if it looks good. Note that we switched to Heroku Postgres to make staging faster, so it should be nicer to use now.

https://bedes-mapping-manager-staging.herokuapp.com/bedes-term/8bfb2292-d621-422f-86ae-407ac895ea67 looks good to me now, as well as the other 19 (I was able to find 20 instances with a regex search in the XML export.)

I have a workaround/hack in the code to handle all the permutations of tags now. But we should fix it on the export side and then remove the hack. Once this is validated, let's open a follow-up to remove the hack in BEDES Manager once the export is fixed.

TravisWalterLBNL commented 3 years ago

Staging looks good to me.

wizonesolutions commented 3 years ago

Live should be good now. Give it a look when you get a chance.

wizonesolutions commented 3 years ago

(accidental closure)

TravisWalterLBNL commented 3 years ago

Looks good on production, so closing this issue. @wizonesolutions, please make sure that any temporary workarounds either make it into the codebase, or are documented in another issue.

wizonesolutions commented 3 years ago

Yeah, the workaround for the file format is in the codebase.

On Fri, Jun 25, 2021, 20:53 Travis Walter @.***> wrote:

Looks good on production, so closing this issue. @wizonesolutions https://github.com/wizonesolutions, please make sure that any temporary workarounds either make it into the codebase, or are documented in another issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LBNL-ETA/BEDES-Manager/issues/180#issuecomment-868767582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACUANBQC2YR4J6WLPW5BXTTUTGBJANCNFSM44TKEQ7Q .