Tokenizing - Githubissues

matildaminerva commented 1 year ago

Hi,

and thank you sharing your code. However, I have had a problem when running it:

in the function "parse_sentence" (row 445) in your snowball.py -file the function arguments are 'self' and 'tokens'. But this function is called from chemdataextractor2/doc/text,py and here the sentence is not tokenized and the function is called only with 'self' (row 817).

Now when I run the extract.py, it gives me error because the sentence has not been tokenized. In the original snowball.py-script automatically installed with chemdataextractor2 the function "parse_sentence" contains also tokenizing row (tokens=sentence.tokens) in row 499.

Should I add also this tokens=sentence.tokens to your updated snowball.py-code function "parse_sentence" in order to get tokens?

QingyangDong-qd220 commented 1 year ago

Thanks for reaching out. Yes, please try adding tokens = sentence.tokens or tokens = sentence.tagged_tokens, both should work (if it doesn't, try if isinstance(tokens, Sentence): tokens = tokens.tagged_tokens). This is most likely caused by a version conflict. In ChemDataExtractor2.1 and later, which is very likely the version you are using, the parse_sentence method from BaseSentenceParser no longer requires the tokens to be passed in as inputs and uses the sentence object directly; whereas in ChemDataExtractor2.0, which is the one I used when writing this code, tokens is a necessary input. A sentence object is different from a list of tokens, which I assume is the cause of this error. Please let me know if there are more issues.

(FYI, the snowball.py was not written by me. I have been working on Snowball 2.0 and it is almost ready to be released, which is significantly different from the previous version. It is meant for ChemDataExtractor2.1 and later, so this error should not happen again. )

matildaminerva commented 1 year ago

Thank you, I added tokens=sentence.tokens and now everything works perfectly! One additional question I still have: what is the difference between the special and general snowball models in this repository? I have now been using just the general one in extract.py -file, but I am just wondering how does the special patterns differ from the general ones.

QingyangDong-qd220 commented 1 year ago

The general snowball model works just as any other parsers. The special snowball model was introduced as a data cleaning technique; it was trained with sentences that contain (seemingly correct but actually) wrong information about the data record, so that anything catched by it can be removed from the database, thus improving precision. But in practice, its effect is almost negligible (less than 1% in my testing), presumably because regular data cleaning methods are powerful enough. The only reason it was mentioned is because I have spent too much time on this..... I would suggest you spend more time on adding new data cleaning filters instead of training a special snowball model.

QingyangDong-qd220 / BandgapDatabase1

Tokenizing #1