covidgraph / graph-processing_fragmentize_text

Create Fragment nodes from full text data (publications/patents)
1 stars 2 forks source link

Text Fragger eats a lot of CPU and hangs #3

Closed motey closed 4 years ago

motey commented 4 years ago
[TEXT_FRAGGER]: checking dependencies: ['CORD19', 'LENS_PATENT_DATA']
Run Datasource container 'covidgraph/graph-processing_fragmentize_text'...
Pull image 'covidgraph/graph-processing_fragmentize_text'...
... pull forced, removing old image
...image 'covidgraph/graph-processing_fragmentize_text' pulled.
'covidgraph/graph-processing_fragmentize_text:latest' using image 'sha256:07e082dd8134a9a9ae4c44ee9f0ba10e6bbc4877978b3acfb932d9a556fcca93'
  envs: {'ENV': 'DEV', 'GC_NEO4J_URL': 'bolt://db-dev.covidgraph.org:7687', 'GC_NEO4J_USER': 'neo4j', 'GC_NEO4J_PASSWORD': 'CureCovid46'}
[TEXT_FRAGGER]: 2020-07-06T19:41:40.002324515Z DEBUG:__main__:bolt://db-dev.covidgraph.org:7687

[TEXT_FRAGGER]: 2020-07-06T19:41:40.002363085Z DEBUG:__main__:neo4j

[TEXT_FRAGGER]: 2020-07-06T19:41:40.002377293Z DEBUG:__main__:CureCovid46

[TEXT_FRAGGER]: 2020-07-06T19:41:40.144500659Z DEBUG:__main__:<Graph database=<Database uri='bolt://db-dev.covidgraph.org:7687' secure=False user_agent='py2neo/4.3.0 neobolt/1.7.17 Python/3.8.2-final-0 (linux)'> name='data'>

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309409067Z DEBUG:__main__:Create fragments for BodyText and Abstract

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309428061Z DEBUG:__main__:Create query for label BodyText, text property text

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309492017Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309502278Z "MATCH (text_node:BodyText) WHERE NOT text_node:CollectionHub AND NOT (text_node)-[:HAS_FRAGMENT]-() RETURN text_node",

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309507140Z "WITH text_node,split(text_node.text, '. ') AS frags

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309511389Z WHERE size(frags) > 0

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309515570Z WITH text_node,frags,range(0,size(frags)-1) AS r

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309519453Z WITH text_node,frags,r

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309523262Z FOREACH ( entry in r | CREATE (f:Fragment:FromBodyText) 

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309527462Z                 SET f.text = frags[entry], f.sequence = entry, f.kind = labels(text_node)[0]

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309531606Z                 MERGE (text_node)-[:HAS_FRAGMENT]->(f) )",

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309535660Z {batchSize: 100, iterateList: true, parallel: false}

[TEXT_FRAGGER]: 2020-07-06T19:41:40.309539600Z )

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139365804Z DEBUG:__main__:Create query to link fragments from BodyText

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139416499Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139437430Z     "MATCH (f:Fragment:FromBodyText) WHERE f.sequence > 0 RETURN f",

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139445729Z     "MATCH (f)<--(n)-->(f2:Fragment:FromBodyText)

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139452988Z     WHERE f2.sequence = f.sequence - 1

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139460188Z     MERGE (f2)-[:NEXT]->(f)",

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139466585Z     {batchSize: 50, iterateList: true, parallel: true}

[TEXT_FRAGGER]: 2020-07-06T19:41:41.139472452Z )

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549710862Z DEBUG:__main__:Create query for label Abstract, text property text

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549795579Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549824752Z "MATCH (text_node:Abstract) WHERE NOT text_node:CollectionHub AND NOT (text_node)-[:HAS_FRAGMENT]-() RETURN text_node",

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549835252Z "WITH text_node,split(text_node.text, '. ') AS frags

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549843283Z WHERE size(frags) > 0

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549851129Z WITH text_node,frags,range(0,size(frags)-1) AS r

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549858433Z WITH text_node,frags,r

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549865743Z FOREACH ( entry in r | CREATE (f:Fragment:FromAbstract) 

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549888743Z                 SET f.text = frags[entry], f.sequence = entry, f.kind = labels(text_node)[0]

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549896803Z                 MERGE (text_node)-[:HAS_FRAGMENT]->(f) )",

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549904660Z {batchSize: 100, iterateList: true, parallel: false}

[TEXT_FRAGGER]: 2020-07-06T19:41:41.549912070Z )

[TEXT_FRAGGER]: 2020-07-06T19:43:17.539913056Z DEBUG:__main__:Create query to link fragments from Abstract

[TEXT_FRAGGER]: 2020-07-06T19:43:17.539960171Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T19:43:17.539974422Z     "MATCH (f:Fragment:FromAbstract) WHERE f.sequence > 0 RETURN f",

[TEXT_FRAGGER]: 2020-07-06T19:43:17.539986609Z     "MATCH (f)<--(n)-->(f2:Fragment:FromAbstract)

[TEXT_FRAGGER]: 2020-07-06T19:43:17.539997435Z     WHERE f2.sequence = f.sequence - 1

[TEXT_FRAGGER]: 2020-07-06T19:43:17.540008007Z     MERGE (f2)-[:NEXT]->(f)",

[TEXT_FRAGGER]: 2020-07-06T19:43:17.540018425Z     {batchSize: 50, iterateList: true, parallel: true}

[TEXT_FRAGGER]: 2020-07-06T19:43:17.540028400Z )

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299137449Z DEBUG:__main__:Create query for label PatentDescription, text property text

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299164642Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299172234Z "MATCH (text_node:PatentDescription) WHERE NOT text_node:CollectionHub AND NOT (text_node)-[:HAS_FRAGMENT]-() RETURN text_node",

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299180979Z "WITH text_node,split(text_node.text, '. ') AS frags

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299187800Z WHERE size(frags) > 0

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299193047Z WITH text_node,frags,range(0,size(frags)-1) AS r

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299197773Z WITH text_node,frags,r

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299202278Z FOREACH ( entry in r | CREATE (f:Fragment:FromPatentDescription) 

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299206998Z                 SET f.text = frags[entry], f.sequence = entry, f.kind = labels(text_node)[0]

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299211691Z                 MERGE (text_node)-[:HAS_FRAGMENT]->(f) )",

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299216611Z {batchSize: 100, iterateList: true, parallel: false}

[TEXT_FRAGGER]: 2020-07-06T19:43:57.299221189Z )

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855256427Z DEBUG:__main__:Create query to link fragments from PatentDescription

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855296145Z DEBUG:__main__:CALL apoc.periodic.iterate(

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855305372Z     "MATCH (f:Fragment:FromPatentDescription) WHERE f.sequence > 0 RETURN f",

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855313057Z     "MATCH (f)<--(n)-->(f2:Fragment:FromPatentDescription)

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855320748Z     WHERE f2.sequence = f.sequence - 1

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855327430Z     MERGE (f2)-[:NEXT]->(f)",

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855334777Z     {batchSize: 50, iterateList: true, parallel: true}

[TEXT_FRAGGER]: 2020-07-06T20:12:39.855341766Z )

This is the Ouput for the last run of the text fragger. It hangs at 20:12. It eat up all 8 CPU cores on the DB VM and runs "forever"

mpreusse commented 4 years ago

The text fragger hangs at the point where it creates NEXT relationships between :Fragment nodes for :PatentDescription. Those texts are much longer than all other texts (publications). Quick fix is to just decrease the batchSize in order to match less :Fragment nodes in one transaction.