Feedback on ACS parser - Githubissues

vtshitoyan commented 6 years ago

Some feedback from Olga based on about 10 randomly chosen papers.

General:

not parsed symbols like &mgr;, &pgr;, +, ...
Figures and Tables captions are embedded in text. They should be either removed or left as separate paragraph
Bunch of '\n' instead of spaces
Need to remove reference numbers
Paragraphs in text are not separated

10.1021/ja068965r:

Where applicable, names of paragraphs should be extracted as separate headings: "X-ray Crystallography. Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..." "name": "X-ray Crystallography" "content": "Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..."

10.1021/ja0024340: Weird symbols: "\nThe energy of this transition state lies 21.1 kcal/mol above\nthe separated species. Using a typical18 &Dgr;S⧧ of −27 cal deg-1\nmol-1,"

eddotman commented 6 years ago

Some questions that we had:

For parsing symbols like &mgr;, is it sufficient to assume that we can escape HTML entities and replace them w/ the unicode / UTF-8 equivalents?
Can we assume that every single \n should be replaced with a single whitespace? Or are there cases where you want to preserve newlines?
All inline reference numbers should simply be stripped away and replaced with nothing, right?
Would it be possible to provide an example of where paragraphs aren't separated (or describe in further detail)?

hhaoyan commented 6 years ago

I know there is a python package “textacy” that can escape HTML codes and convert to UTF8 characters. It can also fix some UTF encoding errors. Perhaps we can use that to post-process the text?

OlgaGKononova commented 6 years ago

@eddotman Hi Eddie! Sorry for a bit late reply:

Yes, I think it would be better to replace with one symbol then to keep entire code.
I don't think we should keep newlines. At least I do not see any purpose of it. So, replacing '\n' with spaces will be fine. Also, make sure you will not end up with line multiple white spaces.
1. I guess so. @tiagobotari how do you treat refs numbers?
2. All the examples I had gone through did not have any paragraphs splitting. I.e. 10.1021/ja038819a, 10.1021/ja011220v, 10.1021/cg500157x, 10.1021/ja029954a, 10.1021/ja068965r, 10.1021/ja0024340... I am not sure how it's done in other parsers. @tiagobotari how do you treat many paragraphs per section? I assume there should be an array of paragraphs per section rather than just a continuous text.

eddotman commented 6 years ago

Great, thanks for the reply @OlgaGKononova -

Perfect. I updated it so that symbols like &pgr; are now p (since the encoding signifies "p in greek", which is apparently some rarely-used XML encoding standard).
Agreed - the new parser does exactly this.
Agreed - this is what I've done currently but I can change the behavior if something else is desirable
Yeah my bad; I checked the code and I think I misread the format the first time because I was forcefully joining the arrays. I removed that line and it works properly now.

OlgaGKononova commented 6 years ago

@eddotman Thank you! To stress on encodings: the reason why I'd prefer to replace it with one-character symbol is that when you do words embeddings or any other tasks on text, it is easier to identify them by simple query and exclude.

tiagobotari commented 6 years ago

I started to test the ACS parser. I create a piece of code to test that, it is similar to those I created for RCS and ECS. LimeSoup-acs/LimeSoup/test/acs_papers/db_test/ParseACS.py This script creates some files in the db_test folder that show the paragraphs recovered from the parser and for just get the paragraphs from the raw_test, is a way to try to track the problems.

Using that I could found some problems.

Problems:

Paper: http://doi.org/10.1021/acsnano.7b02500 Lost some paragrams in the Results/Discussion section
Papers:
http://doi.org/10.1021/es405433t http://doi.org/10.1021/jm401352a http://doi.org/10.1021/cm9035693 No paragrams are recovered by the parser.

This piece of code to test can be improved to work better with ACS journals. My experience is that we go through some papers and you find some patterns that were not implemented in the parser.

@eddotman please let me know if you need something related to that.

eddotman commented 6 years ago

Thanks for putting that together @tiagobotari

I looked at the test code - it looks like you're testing HTMLs, but the parser for ACS is currently only built to support ACS XMLs (as this is the format that ACS has been using to deliver full text articles to us). So any HTML files would fail using this parser (and/or a separate HTML parser would be needed for ACS HTMLs, but we don't have any method of obtaining ACS HTMLs as far as I know).

tiagobotari commented 6 years ago

Can we discuss that? We can do that anytime today after lunch.

eddotman commented 6 years ago

I can do between 3-4pm ET today if that works for you.

tiagobotari commented 6 years ago

3pm is 12:00 here, can it start at 4pm? I can do that now as well

eddotman commented 6 years ago

Unfortunately I have an appointment then - I can do any time between 12pm - 5pm ET tomorrow though.

tiagobotari commented 6 years ago

I can't, but on Thursday I can anytime. Is Thursday good for you?

eddotman commented 6 years ago

Yep, Thursday is good. How about 2pm ET? Can you send a zoom link in the email thread?

eddotman commented 6 years ago

Just to double check - are we still good for a call in ~1hr?

tiagobotari commented 6 years ago

Hi @eddotman, I will send the link of the call 15 mim before the start, Best,

tiagobotari commented 6 years ago

Video Call: https://zoom.us/j/399439114

eddotman commented 6 years ago

Quick update post-call:

There is definitely some bug related to nesting XML paragraphs (since XMLs don't have h1/h2/etc tags), and some paragraphs are being dropped in some edge cases. One edge case is this file xmls/101021acsnano7b02500.xml so I will look into a fix for that and update my pull request

tiagobotari commented 6 years ago

Thank you very much @eddotman. Also, we identified that some of the paper in our database are in HTML and not in XML. We need to check that and see if they are in a large number (probably not). If the number of HTML paper is small we can just remove them. If not, maybe we need to download them from the internet again.

vtshitoyan commented 6 years ago

I think there are only a couple of thousand htmls for ACS, I had checked this before. not worth the time

eddotman commented 6 years ago

Ok the missing paragraphs should be fixed on the latest commit now. The issue was that I was systematically missing paragraphs like the following...

[1] 2. Results
[2] This is some text...
[3] 2.1 Synthesis
[4] This is some other text

I was missing cases like line [2] since the text is in between sections/subsections. Anyway, it should be working now.

eddotman commented 6 years ago

The results are now stored like the attached image. If you look at the content for a section, some of the content will be strings, but others will be nested objects if they are subsections.

eddotman commented 5 years ago

Hey everyone, just wanted to check in on this and PR #23 - any thoughts on if it's ready to merge/use? Would it make sense to have a call to sync up soon? I can send out a doodle poll / email for times next week if that's generally good for people @ Berkeley.

vtshitoyan commented 5 years ago

@tiagobotari could you confirm that this is good to go? I'll then merge it with the master, re-parse the ACS and we can go from there. A call would probably be useful to follow up with fixes of Springer and Wiley parsers?

tiagobotari commented 5 years ago

@eddotman and @vtshitoyan It seems that is working fine, we can merger that. I would like to come back on the parser after 13/11 if it is not urgent.

eddotman commented 5 years ago

Sure - I don't think it's urgent. In any case, I think it makes sense to confirm that the ACS parser is working as intended before worrying about polishing the other parsers, since having the ACS parser "finished" (more or less) gives us a nice starting point for building out shared APIs / Python methods that can interface w/ both the MIT and Berkeley backends.

tiagobotari commented 5 years ago

@eddotman I agree with you. Thank you for bringing that up.

eddotman commented 5 years ago

Great! @vtshitoyan are you cool with running the ACS parser as-is and letting us know how that goes once it's done running? Thanks!

vtshitoyan commented 5 years ago

Will do.

vtshitoyan commented 5 years ago

@OlgaGKononova @tiagobotari I have re-run the ACS parser, with results stored in the Paper_Parsed collection. Could you please take a few random examples using the following filter {"publisher": "American Chemical Society (ACS)", "parser_version" : "0.2.2"}, and make sure you are happy with it. I'll close this issue after your feedback and we can move on to implementing the fixes for the other 2 parsers.

eddotman commented 5 years ago

Quick check - any status update on this @OlgaGKononova @tiagobotari @vtshitoyan ?

I figure that if we're reasonably happy with this parser, then we can move onto both a unified python API using both databases + fixing up the other parsers (as @vtshitoyan suggested). Cheers!

vtshitoyan commented 5 years ago

I did a quick check and the parsed paragraphs look good to me. Closing this issue.

CederGroupHub / LimeSoup

Feedback on ACS parser #21