Closed kwalcock closed 6 years ago
Thanks @kwalcock!
We've seen this behavior before: the text produced from the PDF reader has huge sentences because of lack of punctuation. Some of the components we have depend on sentence length, so they are completely off for very long sentences.
One simple solution I suggest: insert an artificial period whenever the text contains two new lines wo/ any text in between. This usually indicates a new paragraph, and we should treat it as such, even when no punctuation is recognized. Also, I'd like to see an example of these files that take forever to parse.
We could have the problem you mention, but this test is being performed with known, fairly well behaved text. It's pasted below. To make the document longer, I just concatenate another copy. We are probably doing more searching and sorting than previously and perhaps I should comment it out to see if that's the culprit.
Food insecurity levels are extremely alarming throughout the country due to conflict, a collapsing economy, low cereal production, poor rainfall in southeastern areas, and the exhaustion of coping capacities after several years of crisis. In Eastern Equatoria - one of the drought-hit areas - the latest food security and nutrition assessments in November 2016 revealed that households experiencing poor food consumption nearly doubled from 19 to 37 percent when compared with the same time in 2015. Rainfall deficits were widely reported in the area, with 93 percent of households indicating this was a major shock during the last agricultural season, impacting pasture and water availability and local food production. Climatic forecasting indicates these areas are likely to experience depressed rainfall between March to May 2017. South Sudan's economic crisis has been driven by the rapidly depreciating value of the South Sudanese Pound (SSP), shortages of hard currency, global declines in oil prices and significant dependence on imports. Conflict, insecurity, market disruption, economic downturn and localized crop failures have caused record high food prices and hunger has spread to locations that were previously stable. Conflict and economic decline have led to violence and displacement. Violence has caused livestock to be looted, killed and disease-prone and crops destroyed, and displacement has caused late planting. These impacts on livestock and crops have resulted in livelihoods being decimated. Food insecurity is becoming more severe across a wider area and is deepening for people already made vulnerable by displacement and conflict. In 2017, food insecurity in Unity, Jonglei and parts of Greater Equatoria and Greater Bahr el Ghazal remained critical as spikes in conflict, economic collapse, and impacts of flooding reduced agricultural production. About three quarters of counties countrywide are expected to face severe food insecurity in the first quarter of 2018. Even after the harvest in late 2017, food prices remained high, and the 2018 lean season is projected to begin early and become worse than in 2017. Particularly for people living in market-dependent urban areas, economic decline has caused a reduction in access to staple food, clean water, and to a variety of foods. The rising cost of living and impact of the conflict have also undermined people's ability to access safe water. It is estimated that only 13 per cent of South Sudanese people have access to improved sanitation, while 85 per cent of the population practice open defecation and only 41 per cent have access to safe water. Families in urban centres have had to spend an increasing portion of their income to obtain clean water, while water trucking has decreased due to the cost of fuel. Borehole repairs have not been possible in areas hardest hit by conflict, including large swathes of Upper Nile, due to lack of access due to insecurity and lack of technical expertise and supplies. Due to conflict and food shortages, more than one in five people in South Sudan have been forced to flee their homes in the past 22 months, including over 1.66 million people who are currently internally displaced and nearly 646,000 people who have fled to neighbouring countries as refugees. Many have been displaced multiple times because of repeated attacks, particularly in counties such as Leer, Koch, Mayendit and Rubkona in Unity State, Fangak and Pigi County in Jonglei, and Malakal and surrounding areas in Upper Nile. Persistent insecurity and armed conflict have disrupted livelihood activities, affected market functionality and shrunk physical access to markets. Acute malnutrition has worsened compared to the same period in 2016 due largely to the unprecedented high levels of food insecurity, widespread fighting, displacement causing poor access to services, high morbidity, extremely poor diet (in terms of both quality and quantity), low coverage of sanitation facilities, and poor hygiene practices. While marginal improvements in levels of acute malnutrition are expected up to December 2017 due to consumption of household production, forecasts for 2018 are deeply concerning with over 1.1 million children under five expected to be acutely malnourished and 269 000 children likely to be severely malnourished.
Yes, the sentence boundaries are reasonable here... Can you please do an ablation test to understand where the most significant slowdown is? Is it parsing? In the grammars? In grounding? In the code following the grammars? Thanks!
The program has been running with a profiler since yesterday and there will be results soon. I wonder how the artificial period you suggested previously will affect provenance if it is inserted outside of the processor. Could the situation be handled by what you are working on now with the tokens, some of which are being inserted behind the scenes like ", and"? If some sort of "paragraph breaker" was in the pipeline, it could look for "not end of sentence puncutation"-newline-newline and create the artifical token to make sure the (potential) sentence has ended. The forthcoming interval correction would take care of provenance considerations.
In the current design, but perhaps not the new one Becky is working on, the few lines of keepCAGRelevant are a hotspot. This is because Mentions are repeatedly compared to each other there, and really a very slow Mention.equals is the culprit. It uses the hashCode. hashCode is a def for Mentions, meaning that it is always recalcuated. A simple equals can bail at the first unequal field, but hashCode doesn't. So, it is always recalculated and calculated to the end, making it very slow. I added a local hashCode buffer and the execution time for keepCAGRelevant went from 170 seconds to 0.664 seconds. I'm checking if anything else makes lots of comparisons. This may be a very good place to use a lazy val if the Mention state doesn't change and it doesn't mess up subclasses. That would mean a patch to processors, though.
The next hot spot was State.mergeLuts which has a line mentions.distinct which will go repeatedly through all the mentions and check for equality. Since this is already in Processors, I added caching to the Mention hashCode. It looks promising so far. Playing with object identity is dangerous, so several people should look at this in the end. The longer text should at least finish.
Nice work, Keith!
On Tue, May 22, 2018 at 4:35 PM Keith Alcock notifications@github.com wrote:
The next hot spot was State.mergeLuts which has a line mentions.distinct which will go repeatedly through all the mentions and check for equality. Since this is already in Processors, I added caching to the Mention hashCode. It looks promising so far. Playing with object identity is dangerous, so several people should look at this in the end. The longer text should at least finish.
[image: image] https://user-images.githubusercontent.com/8679738/40395649-fb49c052-5ddd-11e8-8462-c847c30921dc.png
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/316#issuecomment-391173926, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniSPxpajouecGYWriCqkfKDkN_HV2ks5t1KCYgaJpZM4UD9G2 .
At a recent meeting I think that somebody suggested running through the PDF files that @EgoLaparra has collected. I did a quick pdftotext over the ones I found on Google Drive and see that a 2016/White_Paper/v118rd550.txt is about 1100KB. The guestimated processing time is around 3.6 hours.
Heads up. So far we've only needed to turn in things that we've processed on our own time. Will we ever need to do something like this while some judge waits for output?
Thanks! Can you explain what these plots mean?
Some lines are just old timings run after some different kinds of optimization were performed. We should be on the grey curve now, but it is quite an extrapolation. The x-axis units were ~4KB texts and I see that I even divided wrong, so here's a version with more standard units. It looks like it might approach 18,000 seconds or 5 hours for the largest file. I don't know about the memory requirements, though. We can't go many times higher on that. The napkin check on feasibility leads me to conclude that it will take a lot of patience to process all the files.
Thanks @kwalcock!
I wonder why the growth is super linear... You did find some issues in Odin code. Any other ideas? (This is not urgent)
I wonder how some things like List.distinct are implemented. It could easily be n^2. The first tests didn't go far enough up the curve to show an obvious culprit.
The largest file in the 17.5k collection of files on Google Drive is 3740KB. This isn't going to finish in a reasonable time. There are numerous (>90) 1MB files that will take around 4 hours each. That's not really reasonable, either. What to do, @MihaiSurdeanu?
Good catch. In Reach, we simply skipped files > 1MB. I suggest the same strategy here. Maybe we can be even more aggressive on this threshold? Say 0.5MB?
This graph attempts to illustrate the relative (file) sizes of the corpora. We began with 10 documents, did some optimization for the collection of 52, and would have a difficult time finishing 17k. Nearly 20% of those files are larger than the largest among the 52 and the largest is more than 10 times larger. One useful strategy would be to sort the files and start with the smallest.
This is not very accurate, especially on the low end, but it shows an estimate of processing time for each file and then total processing time if they are sorted from smallest to largest. Based on this, it would take several thousand hours to process all 17k documents. The smallest 20% of them might be done in around one hour. 40% of them might be processed overnight in 10 hours. (I'm not sure what Excel is doing with the out of place orange dot.)
To what extent might this be a solution to indexing: don't include the indra data in advance? Instead, we index all the text files and include the file names and file sizes in the index. Someone does a search and the answer is the names and sizes of the files that would need to be processed in order to see something interesting. Something downstream can filter out files that are too large or ones with low scores and run eidos or indra on whatever is though suitable.
Someone has certainly thought of this before. We should divide up large documents into chapters of some reasonable length (e.g., 100k) and process them separately. The resulting mentions should have their sentence numbers, offsets, etc. all cleaned up afterwards. Call them chapters because they will be multiple virtual pages long and are required to end on paragraph boundaries. We would just be missing potential cross-sentence mentions. If need be, the chapter number could be added to the data structure so that we can see afterwards where cross-sentence mentions were prevented from being made. Keeping the parts small will keep us on the horizontal part of the scalability graph. Chapters might be processed in parallel.
I like your last point (dividing the files into chapters) for the files we have to process. But for the purpose of this exercise, I would:
Just to be sure... It's Eidos that I've been running. I've been assuming that Patrick was taking the Eidos output and running Indra on it.
No, Indra calls Eidos programmatically. I don't know if they can read Eidos's output into Indra. Can you please ask Patrick if they have this functionality? This would save them some CPU cycles...
Sure, if Eidos already produced some output as JSON-LD, INDRA can just process those into Statements. The reading isn't required to be done with the involvement of INDRA (we just do it that way because then we can control it through a Python API).
I've got a test running to check the feasibility of processing all files that are less than 20,000 bytes, which should be 40% of the collection. It will produce a large file that might be difficult to transfer. Can someone specify which ontologies and which vectors?
@bsharpataz ?
I'll go ahead and use all ontologies and glove.
That's what i would have said, but fwiw if we see diffs among ontologies, you may want to check with MESH ontology which is far larger than any of ours
On Tue, Aug 14, 2018 at 6:09 PM Keith Alcock notifications@github.com wrote:
I'll go ahead and use all ontologies and glove.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/316#issuecomment-413063998, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniYBGI74Bwv-uovcz-w5Mpq4zpD3kks5uQ3ThgaJpZM4UD9G2 .
It did better than expected and got through all text files with sizes less than 50,000 bytes, which is the shorter 60% of the files. A zip file has been prepared and can be found at GoogleDrive/WorldModelers/resources/corpora/FAO/UN_eidos_jsonld.zip. It's over 1GB.
As discussed at a recent Monday morning meeting, some of our Apps are expecting input files to have one paragraph per line. Each paragraph is treated as a separate document and processed. This works well with the files that were converted from PDFs by hand, like the 8 or 10 we have been using for testing. With the full set of 52 which were automatically converted, however, this doesn't work well. The PDF converter seems to have output one line per PDF line and not attempted to make paragraphs. These lines are usually not even complete sentences. The way to handle this is to treat the entire file as a document and let the Eidos system extract the sentences. This isn't working well. Many files never run to completion. I made a small and informal test program that uses a text of approximately 4KB from CAG.fulltext and am timing linear increases in that text length. It doesn't look good. A 16KB file is taking around 30 minutes and the largest of the 52 test files is still 10 times larger.