Closed bgyori closed 7 years ago
@bgyori, thanks. This paper doesn't appear to be a part of the open access subset. Are you processing just the nxml
for the abstract, or are you processing the full text? If the latter, what format are you using? If it is the raw text of the full paper and you're including the references, that may be one reason.
Just the raw text of the abstract. When open access full text is not available, we usually read the abstract as raw text using the ApiRuler.annotateText API. One thing to note about the abstract is that it has a lot of unicode characters. Maybe that's relevant?
@bgyori, when you get a chance, would you please attach a .txt
file to this issue with the raw text that you sent to Reach?
Here it is, the text in the file is UTF-8 encoded (I also pasted the raw text in my original question above). PMID27551758_abstract.txt
It seems that this specific sentence is the problem:
Compared with H/R group , H/R+ tanshinone IIA ( 5muM ) group , H/R+ tanshinone IIA ( 50muM ) group H/R+ AG490 ( 50muM ) group and H/R+ AG490 ( 50muM ) + tanshinone IIA ( 50muM ) group had increased cell viability , decreased apoptosis rate , reduced proportions of cells into G0/G1 phase , elevated proportions of cells in S phase and G2/M phase , as well as down-regulated expressions of JAK2 , STAT3 , p53 , Bax , Caspase-3 , pJAK2 and pSTAT3 , elevated expression of Bcl-2 ( all P < 0.05 ) .
According to conversations in today's meeting, the problem here is supposedly not the speed of the parser. We'll look into this further. Thanks for narrowing it down, @marcovzla.
Note that issue #463 is probably another report of this problem.
Profiling this now...
I'm not specifically interested in this abstract but it is one example on which REACH seems to take a very long time. I'm putting it in as an issue because looking at why it's so slow might expose some more general issue. If not, feel free to close this!
https://www.ncbi.nlm.nih.gov/pubmed/27551758