ESBigeard / paper_graph

Dev/tools repo for a project about scientific papers mining to construct graphs
2 stars 0 forks source link

Paper Graph

Dev/tools repo for a project about scientific papers mining to construct graphs

The tool used to convert PDF articles into TEI XML files is Grobit.

Working pipeline

From canceropole PDF articles to the website showing the graph

 +--------------------+
 |                    |
 |   PDF articles     |
 |                    |
 +---------+----------+
           |
 +---------v----------+
 |                    |
 |     grobit         |
 |                    |
 +---------+----------+
           |
           |        +----------------------------------+          +------------------------------------+
           |        |  generate_html_article_pages.py  |          | html pages with                    |
           |    +--->                                  +--------->+ the text of the articles           |
           |    |   +----------------------------------+          | and important sentences in yellow  |
           |    |                                                 |                                    |
           |    |                                                 +------------------------------------+
           |    |
 +---------v----+-----+          +--------------------------------------+
 |                    |          | utilsperso.edif_idf()                |
 |      TEI XML files +--------->+ (check the bottom of utilsperso.py   |
 |                    |          | there's a few lines that allow       |
 +--------+-----------+          | standalone launching of              |
          |                      | edit_idf()                           |
+---------v-----------+          |                                      |
|                     |          +-----------------+--------------------+
| generate_gephi_csv.py                            |
|                     |                            |
+----------+----------+            +---------------v----------------+
           |                       | idf.pickle                     |
 +---------v----------+            | I've added an idf file in the  |
 |  nodes.csV         |            | git for convenience, but       |
 |  edges.csv         <------------+ a new one should be            |
 |                    |            | generated for each corpus      |
 +---------+----------+            |                                |
           |                       +--------------------------------+
 +---------v-----------+
 | aman's script       |
 | adds coordinates for|
 | similary view       |
 | and similar nodes   |
 |                     |
 +----------+----------+
            |
 +----------v-------------------------+
 |  convert_id_to_tile.py             |
 |  Aman's script gives similar nodes |
 |  as ID. this converts to           |
 |  node label                        |
 |                                    |
 +---------+--------------------------+
           |
     +-----v-----+
     |   Gephi   |
     +-----+-----+
           |
    +------v------+
    |GEXF XML file|
    +------+------+
           |
+----------v-----------------------+
| this javascript website          |
|https://github.com/raphv/gexf-js  |
| with small changes               |
+----------------------------------+

For the paper, from several corpora (GSM, DBLP, ACL anthology) to .dat files

generate_aman_features.extract_acm() glove : /home/sam/work/glove

Main scripts and useful stuff

bibliography

main script for canceropole. takes a folder of tei xml generated by grobit, outputs nodes.csv and edges.csv ready for gephi

necessary to make anything else run

creates the html pages for each article with the main sentences highlighted in yellow

for the most similar nodes added by aman, replace the ID of each node by its label

for the paper, generates the .dat files that aman uses to run the experiments