Closed vtshitoyan closed 5 years ago
Some questions that we had:
For parsing symbols like &mgr;
, is it sufficient to assume that we can escape HTML entities and replace them w/ the unicode / UTF-8 equivalents?
Can we assume that every single \n
should be replaced with a single whitespace? Or are there cases where you want to preserve newlines?
All inline reference numbers should simply be stripped away and replaced with nothing, right?
Would it be possible to provide an example of where paragraphs aren't separated (or describe in further detail)?
I know there is a python package “textacy” that can escape HTML codes and convert to UTF8 characters. It can also fix some UTF encoding errors. Perhaps we can use that to post-process the text?
@eddotman Hi Eddie! Sorry for a bit late reply:
Great, thanks for the reply @OlgaGKononova -
&pgr;
are now p
(since the encoding signifies "p in greek", which is apparently some rarely-used XML encoding standard).@eddotman Thank you! To stress on encodings: the reason why I'd prefer to replace it with one-character symbol is that when you do words embeddings or any other tasks on text, it is easier to identify them by simple query and exclude.
I started to test the ACS parser. I create a piece of code to test that, it is similar to those I created for RCS and ECS.
LimeSoup-acs/LimeSoup/test/acs_papers/db_test/ParseACS.py
This script creates some files in the db_test folder that show the paragraphs recovered from the parser and for just get the paragraphs from the raw_test, is a way to try to track the problems.
Using that I could found some problems.
Problems:
Paper: http://doi.org/10.1021/acsnano.7b02500 Lost some paragrams in the Results/Discussion section
Papers:
http://doi.org/10.1021/es405433t
http://doi.org/10.1021/jm401352a
http://doi.org/10.1021/cm9035693
No paragrams are recovered by the parser.
This piece of code to test can be improved to work better with ACS journals. My experience is that we go through some papers and you find some patterns that were not implemented in the parser.
@eddotman please let me know if you need something related to that.
Thanks for putting that together @tiagobotari
I looked at the test code - it looks like you're testing HTMLs, but the parser for ACS is currently only built to support ACS XMLs (as this is the format that ACS has been using to deliver full text articles to us). So any HTML files would fail using this parser (and/or a separate HTML parser would be needed for ACS HTMLs, but we don't have any method of obtaining ACS HTMLs as far as I know).
Can we discuss that? We can do that anytime today after lunch.
I can do between 3-4pm ET today if that works for you.
3pm is 12:00 here, can it start at 4pm? I can do that now as well
Unfortunately I have an appointment then - I can do any time between 12pm - 5pm ET tomorrow though.
I can't, but on Thursday I can anytime. Is Thursday good for you?
Yep, Thursday is good. How about 2pm ET? Can you send a zoom link in the email thread?
Just to double check - are we still good for a call in ~1hr?
Hi @eddotman, I will send the link of the call 15 mim before the start, Best,
Video Call: https://zoom.us/j/399439114
Quick update post-call:
There is definitely some bug related to nesting XML paragraphs (since XMLs don't have h1/h2/etc tags), and some paragraphs are being dropped in some edge cases. One edge case is this file xmls/101021acsnano7b02500.xml
so I will look into a fix for that and update my pull request
Thank you very much @eddotman. Also, we identified that some of the paper in our database are in HTML and not in XML. We need to check that and see if they are in a large number (probably not). If the number of HTML paper is small we can just remove them. If not, maybe we need to download them from the internet again.
I think there are only a couple of thousand htmls for ACS, I had checked this before. not worth the time
Ok the missing paragraphs should be fixed on the latest commit now. The issue was that I was systematically missing paragraphs like the following...
[1] 2. Results
[2] This is some text...
[3] 2.1 Synthesis
[4] This is some other text
I was missing cases like line [2]
since the text is in between sections/subsections. Anyway, it should be working now.
The results are now stored like the attached image. If you look at the content
for a section, some of the content will be strings, but others will be nested objects if they are subsections.
Hey everyone, just wanted to check in on this and PR #23 - any thoughts on if it's ready to merge/use? Would it make sense to have a call to sync up soon? I can send out a doodle poll / email for times next week if that's generally good for people @ Berkeley.
@tiagobotari could you confirm that this is good to go? I'll then merge it with the master, re-parse the ACS and we can go from there. A call would probably be useful to follow up with fixes of Springer and Wiley parsers?
@eddotman and @vtshitoyan It seems that is working fine, we can merger that. I would like to come back on the parser after 13/11 if it is not urgent.
Sure - I don't think it's urgent. In any case, I think it makes sense to confirm that the ACS parser is working as intended before worrying about polishing the other parsers, since having the ACS parser "finished" (more or less) gives us a nice starting point for building out shared APIs / Python methods that can interface w/ both the MIT and Berkeley backends.
@eddotman I agree with you. Thank you for bringing that up.
Great! @vtshitoyan are you cool with running the ACS parser as-is and letting us know how that goes once it's done running? Thanks!
Will do.
@OlgaGKononova @tiagobotari I have re-run the ACS parser, with results stored in the Paper_Parsed collection. Could you please take a few random examples using the following filter {"publisher": "American Chemical Society (ACS)", "parser_version" : "0.2.2"}, and make sure you are happy with it. I'll close this issue after your feedback and we can move on to implementing the fixes for the other 2 parsers.
Quick check - any status update on this @OlgaGKononova @tiagobotari @vtshitoyan ?
I figure that if we're reasonably happy with this parser, then we can move onto both a unified python API using both databases + fixing up the other parsers (as @vtshitoyan suggested). Cheers!
I did a quick check and the parsed paragraphs look good to me. Closing this issue.
Some feedback from Olga based on about 10 randomly chosen papers.
General:
10.1021/ja068965r:
10.1021/ja0024340: Weird symbols: "\nThe energy of this transition state lies 21.1 kcal/mol above\nthe separated species. Using a typical18 &Dgr;S⧧ of −27 cal deg-1\nmol-1,"