clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

This is one way to add section names without changes to processors #776

Closed kwalcock closed 1 year ago

kwalcock commented 1 year ago

This latest change brings in the serialization. Just now it expects the document being saved to be a ReachDocument and the document being loaded from a file to be a ReachDocument rather than a plain Document in either case. It isn't type checked. That can be improved if need be. For example, if you are reading in old files, one would need to take into account that the sections aren't there. I haven't yet.

kwalcock commented 1 year ago

With the latest update (assuming tests pass), this might suffice. Since the sections aren't calculated by processors, it seems to me even more justified that processors not keep track of them.

enoriega commented 1 year ago

@kwalcock I like this one. My only concern is that this is not compatible with the serialization format of the original PR and I have a corpus of 122k serialized papers using it. I am still thinking if it would be better to change the serialization code here to make it work as in the original PR or changing the serialized files to be compatible with this format

kwalcock commented 1 year ago

I would deserialize them with the branch you were using and reserialize them by adding a reserialize method that calls serialize and then tacks on the sections much like this branch does--or just stick with the branches.

enoriega commented 1 year ago

@kwalcock I figured that it is trivial to transform the "OG" serialization format to the new one. I tested it and it works. So I think it is safe to go ahead with this PR. I will handle the conversion offline.

I also changed some signatures in FriesOutput and a previously undiscovered bug.

kwalcock commented 1 year ago

I'm not sure all those ReachDocuments in the signatures are necessary. Did it crash at runtime?

kwalcock commented 1 year ago

@enoriega, the import org.clulab.reach.ReachSentence.Converter is supposed to take care of it. Nothing needs to know that a Document is really a ReachDocument or that a Sentence is really a ReachSentence until something tries to access the sections at which time the implicit conversion in object ReachSentence.Converter takes place. That was the idea, at least, to spare you from having to make changes all over the place. I hope it works.

enoriega commented 1 year ago

It didn't crash, but it wasn't getting section names. At the beginning I thought it was because some polymorphism I got wrong, but then I realized there was a lingering overloaded method being used that didn't take the section names. I will use the implicit converted on the section_names branch