Fixed issues in the spacy component

m0baxter commented 2 years ago

The spacy component can fail in a few ways.

If the head or tail text contains special regex characters like "." the matching can be unpredictable. This was fixed by using str.find instead.
sometimes the model will capitalize words in head or tail. Fixed by lower casing all text before searching.
character encodings of different sizes can cause doc.char_span to return None in certain cases. Fixed by using expand for alignment_mode to minimize the problem and catching the cases where the span is still None.

LittlePea13 commented 2 years ago

Thank you very much for the Pull request. However I am not sure about:

            if ((head_index == -1) or (tail_index == -1)):

                continue

Since the model is auto-regressive, there's cases (like what you mention with capitalization) where the text can suffer small modifications and not be found in the input. However those triplets may still be interesting as part of the output, and would be lost this way. So not sure just getting rid of them is a good idea.

m0baxter commented 2 years ago

I think it deppends on whether you are interested in precision or recall.

As an example. I am interested in finding triples in news articles. One article I tried to parse

Artifact from Space Shuttle Challenger found on ocean floor, NASA confirms,  WMUR reports. Discovery comes nearly 37 years after tragedy.

Discovery comes nearly 37 years after tragedy.

A piece of the Space Shuttle Challenger was recently found off the coast of Florida, NASA announced in a news release Thursday.

The artifact was discovered by a History Channel documentary crew diving for wreckage of a World War II - era aircraft, NASA officials wrote.

Divers found a large object on the seafloor, and given its proximity to Florida's Space Coast, members of the documentary team decided to contact NASA, whose leaders reviewed the footage and confirmed the object came from the Challenger.

"This discovery gives us an opportunity to pause once again, to uplift the legacies of the seven pioneers we lost, and to reflect on how this tragedy changed us. At NASA, the core value of safety is - and must forever remain - our top priority, especially as our missions explore more of the cosmos than ever before," NASA administrator Bill Nelson said in a statement.

The discovery comes nearly 37 years after the shuttle exploded, killing New Hampshire teacher Christa McAuliffe and six other crew members: Dick Scobee, Mike Smith, Ronald McNair, Ellison Onizuka, Judith Resnik and Gregory Jarvis.

"NASA currently is considering what additional actions it may take regarding the artifact that will properly honor the legacy of Challenger's fallen astronauts and the families who loved them," officials wrote in the news release.

Officials said all space shuttle artifacts are U.S. government property and anyone who finds any such objects should email ksc - public - inquiries@mail.nasa.gov to "arrange for return of the items."

NASA officials said a show about the so - called Bermuda Triangle will depict the crew's discovery of the Challenger artifact. It airs on Nov. 22.

produces this triple:

{'head': 'John F. Kennedy', 'type': 'significant event', 'tail': 'September 11, 2001'}

which is neither true, nor supported by the text.

If the goal of this exercise is to create somehting that extracts information from text I would argue that it is better to only produce things that are directly supported by that text. While the loss of facts due to slight spelling mistakes is unfortunate the production of facts that cannot be verified without an existing database of information makes this far less useful as a system for extracting information.

LittlePea13 commented 2 years ago

I do not necessarily agree in general, as one may want to explore all the predictions by the model. However for the Spacy component it is true that most users have a similar use-case as to what you describe, and if someone wants raw predictions they can use HF pipeline or pytorch directly. Thanks for the pull request and hope REBEL has been useful for you.

Babelscape / rebel

Fixed issues in the spacy component #48