Extracting data behaves weirdly

QingyangDong-qd220 / BandgapDatabase1

Codes to generate a bandgap database using ChemDataExtractor.

http://chemdataextractor.org/

MIT License

9 stars 0 forks source link

Extracting data behaves weirdly #2

Open matildaminerva opened 1 year ago

matildaminerva commented 1 year ago

Dear author,

I have used successfully your code with ChemDataExtractor2 to extract bandgap values. However, I have spotted some weird behaviour that I have not yet been able to find out why it happens: I have a set of 600 texts and when I run your code through these texts I get in total 114 extracted bandgap values, but, when I slip the text set into 120 text chunks and run the code separately through these 5 chunks, I get in total over 300 values. The scrip and texts are completely same, only thing that changes is the amount of text is processed with one run. Has this kind of behavior happened before or do you have any idea what would be causing this?

Best, Matilda

QingyangDong-qd220 commented 1 year ago

Hi, thanks for pointing this out. This behavior is unexpected, but I'm not entirely sure what you mean by "600 texts". Is that 600 word tokens or 600 sentences? I believe that in this version of the Snowball parser, there are constraints that limit the maximum length of one sentence, specifically no more than 300 tokens per sentence, if my memory is correct. If you send multiple sentences at once without splitting them, that limit can be easily reached, and the whole chuck will be skipped. Ideally I would like to see your code and texts to pinpoint the exact cause of this issue, but my initial guess is that you need to pass one sentence at a time, instead of cutting everything into 5 batches. Hope this helps.

Best wishes, Qingyang

matildaminerva commented 1 year ago

Aha, this is good to know, but I am still wondering if this is the reason since I am splitting my text into sentences. In my case, this "600 texts" means snippets of text, so that one snippet is 7 sentences long and then there 600 of those 7 sentence long snippets in total. I am using your code 'extract.py' from repository BandgapDatabase1 quite without changes (here is the file I am using extract_Matilda.txt ), so that the "list of articles" mentioned in code is in my case a folder with 600 .txt-files where every file contains a 7-sentence snippet.

QingyangDong-qd220 commented 1 year ago

I have looked through your code, but I still couldn't figure out how this issue occurs. Sentence splitting was not the cause, apparently. I may have to recreate this problem if we are to solve this issue for good. If you could send me one of your .txt files that consistently yields more data records when sent as small batches, I can have a closer look. In the mean time, if processing files in smaller chunks gives you more results, I would presume that processing one file at a time might give even better results.

matildaminerva commented 1 year ago

Yes, I could actually send my whole 600 text dataset since this is the one I have been able to repeate the problem. Would it be possible to get your email in order to send the texts?

QingyangDong-qd220 commented 1 year ago

Of course, please send everything to qd220@cam.ac.uk, and I will get back to you if I find anything. Meanwhile, Snowball 2.0 was just released two days ago, which also works with ChemDataExtractor2. Please give that a try. The code is under my project Snowball_2.0, and the link to the paper can be found in the manual (https://pubs.acs.org/doi/10.1021/acs.jcim.3c01281).