clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Decouple kbs #750

Closed enoriega closed 3 years ago

enoriega commented 3 years ago

This change is complete. The KB files are now loaded on the fly, no need to pre-process the files any more.

enoriega commented 3 years ago

@MihaiSurdeanu This is ready. The tests should be passing. The lexicon builder change in processors is still pending. I propose handling that change in a new PR when it is implemented.

kwalcock commented 3 years ago

I'm looking at this closely in order to attach it to the modified processors that has some of this already built in. In KBGenerator.convertKB, the information is read from a file and converted to an in-memory representation. That was probably a necessity before, but in order to account for both file- and memory-based representations, some differences have been abstracted away in the meantime. It seems like I should be able to make a variation on the standard KB so that the information can be processed without the intermediate representation. Is this reasoning faulty?

enoriega commented 3 years ago

@kwalcock as long as KBGenerator.processKBFiles is able to read the information from the configuration file (bioresources/src/main/resources/application.conf), the signature of convertKB doesn't matter much.

The main idea is to have a single point of configuration for the KB files (the application.conf file), which is parsed each run

kwalcock commented 3 years ago

IIRC, at the point that .groupBy(_._1) gets called, there are multiple KBs per label, something like 24 total with up to 4 per label and ones with the same label aren't adjacent. They do need to get grouped, but they also need to be in some order, an order which does not seem to be the same as the one achieved by code like this, where the first new label determines the order:

   val actualLabels = kbEntries.map(_.neLabel).distinct

They seemed somehow to be in the right order when they were taken from the map before. Was that by chance?

enoriega commented 3 years ago

@kwalcock you recall correctly. The map was a convenient way for me to iterate over (label, seq of kbs), in the end the InlineLexiconBuilder would iterate over 24 tuples (label, entries) by flattening the map.

The order must be dictated by the priotity property of the entry object. Your observation is right, since the entries are not adjacent, iterating over the map such as that messed up the order. This was masked when I enforced the specific order by hardcoding the sequence of labels (not by chance) and appeared when I removed it.

The latest commit replaced the map by a Seq[(label, entry)], where the client code calls the build method passing the sequence with the tuples already ordered and preserved by the Seq instance. This fixed some of the issues but there still seem to be some others I have yet to fix.

kwalcock commented 3 years ago

The tests below are failing. They kind of look like the tests that failed for me when the KBs weren't in the right order.

[error] Failed tests:
[error]     org.clulab.reach.PolaritySuite
[error]     org.clulab.reach.TestTranscriptionEvents
[error]     org.clulab.reach.TestEntities
[error]     org.clulab.reach.TestOverrides
enoriega commented 3 years ago

Thanks @kwalcock. I will fix them soon

kwalcock commented 3 years ago

Let me know if I can/should help.

kwalcock commented 3 years ago

Thanks. I'm not sure of the syntax of the entries, but some have priority twice. Is that expected?

enoriega commented 3 years ago

There should be zero or one priority fields on each KB block. If one has two priorities, one will override the other, not sure which, but it should be fixed.

I fixed the entry with double priority, let's hope the tests still pass after this

kwalcock commented 3 years ago

If I didn't mess it up at some point in the past, what is now labels used to be derived from a filename, so everything with the same label would have had the same priority, or at least an adjacent priority. (And that is about all the rest of the code supports.) I see now a Simple_chemical with priorities of 5, 14, 15, 16. It seems like their effective priorities would end up being 5a, 5b, 5c, and 5d. Will that be OK?

enoriega commented 3 years ago

With this change, the priority is explicitly set by the priotity field. If two entries have the same priority (priority X), which ever was read first by the code will be Xa and the later Xb.

In this case I didn't update gendChemical, but it happens that the tests still pass.

enoriega commented 3 years ago

@kwalcock I looked again into processKBFiles in your branch and I see how entries with the same label that are not adjacent might end up having a different order than the one specified in the priority field. I don't think this is a very big deal, but it might create confusion in the future and I think it is worth not risking that. I see two ways to avoid it:

  1. Refactor processKBFiles such that it creates an instance of StandardKBSource per KBEntry. I don't know if having multiple sources per label would break the code you have already written
  2. Refactor the config file to make the priorities per label instead of per entry.

I prefer the first, as the second would make it harder for the end user, but I am not sure if it would require a lot of changes in your port. What do you think?

kwalcock commented 3 years ago

@enoriega, I merged your changes into mine and just completed running the unit tests locally and they passed. In the end, each KBEntry does turn into a single ReadSingleStandardKbSource, but it is essential for now that each label has only one priority, and that is guaranteed by the ReachMultiStandardKbSource. So, I think it's good for now.

If one label had two priorities, what would happen if an override KB used that label? It wouldn't know which priority to assign.

enoriega commented 3 years ago

If an entry with any arbitrary label is found in the override KB, it will always have priority over the standard kb. Although the code seems prepared for more than one override kb file, it seems like it is hardcoded to use only one.

kwalcock commented 3 years ago

It doesn't matter much, but IIRC it does get more complicated. The override trumps the standard KB of the same label, but not necessarily those of other labels. Entries can appear in multiple KBs or multiple times in a single KB, etc. In the end it is important that the standard ones have a single priority. After I get processors republished with the changes, the test in the other branch should pass.