CentreForDigitalHumanities / tscan

T-scan: an analysis tool for dutch texts to assess the complexity of the text, based on original work by Rogier Kraf
GNU Affero General Public License v3.0
18 stars 6 forks source link

Frog output isn't received: Empty result for FoLiaParsing #85

Open oktaal opened 11 months ago

oktaal commented 11 months ago

With some complicated texts (probably) the Frog output isn't received by T-Scan from Frog. Resulting in an empty document.

It starts here:

https://github.com/UUDigitalHumanitieslab/tscan/blob/1ae5af3f22929366b3d58a66dddb324153e76d2c/src/tscan.cxx#L3126-L3130

A connection is made with Frog, using this connection the plain text is send to Frog.

The text is then closed using EOT and Frog starts processing it. The response should then be read and written to result.

  client.write( "\nEOT\n" );
  string result;
  string s;
  while ( client.read( s ) ) {
    if ( s == "READY" )
      break;
    result += s + "\n";
  }

https://github.com/UUDigitalHumanitieslab/tscan/blob/1ae5af3f22929366b3d58a66dddb324153e76d2c/src/tscan.cxx#L3194C6-L3201

This will contain the FoliaXML in plain text which can then be loaded using

    doc = new folia::Document();
    try {
      doc->readFromString( result );
#ifdef DEBUG_FROG
      cerr << "finished" << endl;
#endif
    }

https://github.com/UUDigitalHumanitieslab/tscan/blob/1ae5af3f22929366b3d58a66dddb324153e76d2c/src/tscan.cxx#L3210C1-L3216

The problem is however that the result is empty! Now this could be a problem with Frog.

So let's use Telnet:

$ telnet localhost 7001
Dit is een hele lang zin die niet gepakt wordt, dit is niet de zin, vraag mij om de zin en dan kan ik hem doorsturen.

EOT

And you get the response as expected:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="tscan" generator="libfolia-v2.15" version="2.5.1">
  <!-- BIG BLOB OF XML -->
</FoLiA>
READY

It may take a little while so it could perhaps be some race/timing issue.

I've tried giving it some more time e.g. adding a sleep before reading. I've also tried reading again after the initial attempt was empty after a sleep. Here's an attempt of me complicating the line while ( client.read( s ) ) { into this:

  client.setNonBlocking();
  // wait at least a little bit for the response
  usleep( 100 );
  while ( true ) {
    if ( !client.read( s ) ) {
      if ( !result.empty() ) {
        // done reading
        break;
      }
      else {
        cerr << "giving Frog some more time... ";
        sleep( 30 );
        if ( !client.read( s ) ) {
          // done reading
          cerr << "still nothing" << endl;
          break;
        }
        else {
          cerr << s.size() << endl;
        }
      }
    }

Anyway, it doesn't work or I wouldn't be writing this issue down. I'm going on leave for a month but I'm not quite sure how to fix this. Possibly it's some bug in the socket implementation, maybe a work-around could be to simply use a different socket library. A super low tech (slow) approach could also be to call Frog from the command line which should also work, but is really slow because Frog will have to start up for every document. Maybe this could be moved to the Python service and let that deal with this. (Prepare all the input as FoliaXML files).

oktaal commented 11 months ago

Work-around: fc28a395ad2271ff4e02fe28a69437bff5523661

proycon commented 11 months ago

Is this issue on the latest ticcutils & frog releases (or latest development versions even) ?

Compiling tscan with DEBUG_FROG might give some further insights, if you hadn't tried that yet.

So let's use Telnet:

$ telnet localhost 7001
Dit is een hele lang zin die niet gepakt wordt, dit is niet de zin, vraag mij om de zin en dan kan ik hem doorsturen.

EOT

And you get the response as expected:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="tscan" generator="libfolia-v2.15" version="2.5.1">
  <!-- BIG BLOB OF XML -->
</FoLiA>
READY

Ok, so that at least confirms the server-side works fine.

It make take a little while so it could perhaps be some race/timing issue.

I think the default methods are blocking, so that shouldn't be the case.

I've tried giving it some more time e.g. adding a sleep before reading. I've also tried reading again after the initial attempt was empty after a sleep. Here's an attempt of me complicating the line while ( client.read( s ) ) { into this:

client.setNonBlocking();
  // wait at least a little bit for the response
  usleep( 100 );
  while ( true ) {
    if ( !client.read( s ) ) {
      if ( !result.empty() ) {
        // done reading
        break;
      }
      else {
        cerr << "giving Frog some more time... ";
        sleep( 30 );
        if ( !client.read( s ) ) {
          // done reading
          cerr << "still nothing" << endl;
          break;
        }
        else {
          cerr << s.size() << endl;
        }
      }
    }

Anyway, it doesn't work or I wouldn't be writing this issue down. I'm going on leave for a month but I'm not quite sure how to fix this. Possibly it's some bug in the socket implementation, maybe a work-around could be to simply use a different socket library.

The current socket implementation is in ticcutils by @kosloot, and is basically just a higher level API wrapper around low-level socket functionality in the standard library. The last major changes seem to have been in v0.24 (2020). But he's also been doing some refactoring in later releases.

@kosloot: Do you have a hunch where things might have gotten broken?

A super low tech (slow) approach could also be to call Frog from the command line which should also work, but is really slow because Frog will have to start up for every document.

Yeah, you definitely don't want that.

kosloot commented 11 months ago

@kosloot: Do you have a hunch where things might have gotten broken?

I see no reason to believe the problem lies within ticcutils. Almost all changes since 2017 or so were cosmetic.

The example input line isn't that either. It should be much longer to scare Frog off. Only when using the build-in Dependency Parser, there is a limit of 500 word tokens per sentence, but Tscan uses Alpino. (the problem couldn't be related to Alpino?)