clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Added support to extract section names #775

Closed enoriega closed 1 year ago

enoriega commented 1 year ago

This PR has the preliminary support to extract section names from the xml structure and store them in the Document object. It relies on an yet unpublished version of processors to support serialization.

FriesOutput has been updated to export section names to the json files

enoriega commented 1 year ago

@kwalcock This is the code that uses the new features from processors

MihaiSurdeanu commented 1 year ago

This LGTM, but I'll defer to @kwalcock.

enoriega commented 1 year ago

@kwalcock We will have to workout the processors release with the adequate changes for this PR to move forward

kwalcock commented 1 year ago

I'm looking at the addition of sections to Sentences now.

kwalcock commented 1 year ago

@MihaiSurdeanu, @enoriega, please see #776 or #777. I was still try wanting to add the custom information after END_OF_DOCUMENT for serialization. It looked feasible.

enoriega commented 1 year ago

This PR now contains the code from #776, which removes the application specific changes in processors by putting them on REACH

enoriega commented 1 year ago

@kwalcock I restored the Converter import that I shouldn't have removed on the first place. I tested this locally and it works well. Can you take a quick look? If everything checks and the tests pass, then I'll merge to master

enoriega commented 1 year ago

@kwalcock Thanks. There is only one place where that document serializer is being used, so we are safe there. Let's see if the changes don't break anything.