delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

AttributeError: 'NoneType' object has no attribute 'data' #325

Closed arademaker closed 3 years ago

arademaker commented 3 years ago
$ delphin process -g erg.dat -o "-n 1 --timeout=60 --max-words=150 --max-chart-megabytes=4000 --max-unpack-megabytes=5000 --rooted-derivations --udx --disable-generalization" -s dataset.a dataset.ap 
...
Traceback (most recent call last):
  File "/u/alexrad/venv/bin/delphin", line 11, in <module>
    sys.exit(main())
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/main.py", line 42, in main
    args.func(args)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/cli/process.py", line 46, in call_process
    gzip=args.gzip)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/commands.py", line 663, in process
    target.process(cpu, **process_kwargs)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/itsdb.py", line 881, in process
    response = cpu.process_item(datum, keys=keys_dict)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/ace.py", line 284, in process_item
    response = self.interact(datum)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/ace.py", line 258, in interact
    result = self.receive()
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/ace.py", line 234, in _tsdb_receive
    response = _tsdb_response(response, line)
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/ace.py", line 688, in _tsdb_response
    for key, val in _sexpr_data(line):
  File "/u/alexrad/venv/lib64/python3.6/site-packages/delphin/ace.py", line 679, in _sexpr_data
    if len(expr.data) != 2:
AttributeError: 'NoneType' object has no attribute 'data'

Trying to obtain more info with the -v option...

goodmami commented 3 years ago

It looks like the S-Expression parser at delphin.util.SExpr.parse() can possibly return None if it consumes the whole line of input and doesn't parse something properly, but it's hard to say what situation might trigger that. I'd start by isolating the item in the profile causing the issue (to make a minimal test case) then print out the lines sent to the S-Expression parser to see what input is causing the problem. One possibility is that one of your ACE options is changing what the output looks like and this causes the S-Expression parser to fail. Hard to say for certain without a reproducible test case.

arademaker commented 3 years ago
delphin process -v -g erg.dat -o "-n 1 --timeout=60 --max-words=150 --max-chart-megabytes=4000 --max-unpack-megabytes=5000 --rooted-derivations --udx --disable-generalization" -s dataset.a dataset.ap

This is my command line. I can edit the python code to add a print, but can I add more verbosity to identify the sentence causing the error?

arademaker commented 3 years ago

OK. Trying now with -vv. Now I have the number of the item being processed printed ...

goodmami commented 3 years ago

@arademaker were you able to isolate a sentence that reproduces the error?

arademaker commented 3 years ago

Yes, but since I am running the process in a computer cluster, my hypothesis is some memory limit killed the ACE process. The same sentence running in my local machine didn't produce a parser but didn't produce any strange output. I have increased the job memory limit in the cluster.. hope to have the processed profiles by tomorrow. Should we close this issue or keep it open?

arademaker commented 3 years ago

A very weird behavior:

INFO:delphin.itsdb:Processed item            27017         1 results
INFO:delphin.itsdb:Processed item            27018         1 results
INFO:delphin.itsdb:Processed item            27019         1 results
INFO:delphin.itsdb:Processed item            27020         1 results
INFO:delphin.itsdb:Processed item            27021         1 results
INFO:delphin.itsdb:Processed item            27022         1 results
INFO:delphin.itsdb:Processed item            27023         1 results
INFO:delphin.itsdb:Processed item            27024         1 results
INFO:delphin.itsdb:Processed item            27025         1 results
INFO:delphin.itsdb:Processed item            27026         1 results
NOTE: parsed 2498 / 2698 sentences, avg 359845k, time 8983.25098s

How can a profile with 27026 sentences is summarized as only 2698 sentences? Which tool is responsible for this line with NOTE? Ace or PyDelphin?

arademaker commented 3 years ago

The produced profile has

alexrad@dccxl001 dataset.ap]$ wc item parse result
    27026    461165   3479648 item
    27026  16418746  95129936 parse
    24909  70578741 420916849 result
goodmami commented 3 years ago

That NOTE line is from ACE. I suspect the profile may have ill-formed or ignored sentences, and that ACE is only reporting those marked well-formed (i-wf = 1). Can you perform the following query with PyDelphin?

$ delphin select 'i-id where i-wf = 1' dataset.ap | wc -l
arademaker commented 3 years ago
% delphin select 'i-id where i-wf = 1' dataset.ap | wc -l
   27023

Does it confirm your hypothesis? If so, what would be an ill-formed or ignored sentence? Why are they not reported as not parsed?!

arademaker commented 3 years ago
% delphin select 'i-id where i-wf = 0' dataset.ap
3577
5329
12803
% awk -F "@" '$1 == 3577 || $1 == 5329 || $1 == 12803' dataset.ap/item
3577@@@@1@@@@@@0@0@@@
5329@@@@1@@ Earth has one moon.@@@@0@4@@@
12803@@@@1@@ Superconductor: A type of electrical conductor that permits a current to flow with zero resistance.@@@@0@15@@@
arademaker commented 3 years ago

Hum, in the original files, all three cases starts with *:

 *
 * Earth has one moon.
 * Superconductor: A type of electrical conductor that permits a current to flow with zero resistance.
arademaker commented 3 years ago

BTW, the data I am using is https://allenai.org/data/scitail, I took tsv_format files, the first column in one profile and the second column in another profile.

goodmami commented 3 years ago

Ok, so that disproves my hypothesis. Sometimes items are skipped if the i-length field is -1 or 0, but I don't think PyDelphin does that, and I'm not sure about art. I think this is a question for Woodley.

For those 3 "ungrammatical" items, the mkprof command looks for * in the first column and, if it's there, interprets that as an ungrammaticality star, removing the star and setting i-wf to 0. The easiest way to disable this is probably to remove those stars from your text file before running mkprof.

$ cat txt
* one two
* * three four
five six
$ sed -r 's/^(\* ?)*//' txt 
one two
three four
five six
goodmami commented 3 years ago

I tried looking a little further into this but I'm not sure if there's any problem in PyDelphin. To recap:

  1. Error during parsing

    You're getting an error when PyDelphin attempts to parse the S-Expression output of ACE, and this happens on one machine but not another; possibly a memory error or something similar.

    Conclusion: unless this can be reproduced reliably with a particular input, I cannot proceed to debug it.

  2. Fewer items processed than in profile

    You have N items in a profile but processing reports X/Y parsed where Y is < N. This can be the case if ACE (or art) skipped some items, which it does for empty inputs or when i-wf is not 1. Another possibility, which I now think may be more likely, is that ACE crashed during processing and PyDelphin was able to recover by restarting ACE. After recovering (possibly multiple times), it processed Y more items until the end of the profile. The restarted ACE process is not aware of any work previous processes performed, so it only reports that Y were processed.

    Conclusion: the number of items reported as parsed by ACE should not be considered reliable.

  3. Items starting with *

    I tested this out and there is no problem. PyDelphin does not have any issue creating profiles from sentences starting with *, even if the line is just *, or with processing those profiles.

    Conclusion: preprocess those items to remove the * if you don't want them interpreted as ungrammaticality markers.

If you have any further issues feel free to open a new issue, or reopen this one if you can reproduce item (1) above reliably.